 |
Computing CRCs in Parallel
The CRC is the best way to guage data comm. Here's how to build one in an FPGA.
Published in Circuit Cellar Ink, September, 1995.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
The CRC (Cyclic Redundancy Check) is a sophisticated checksum
that has long been the most common means of testing data for correctness.
Every sector on a disk has a CRC of the data stored, so the operating
system will be aware of data dropouts. A CRC doesn't identify
which byte is in error, but it pretty much guarantees that you'll
be alerted to at least single bit errors.
All CRCs are binary polynomials that are divided into the data
stream. Although the definition and selection of the CRC is quite
complex, its use is not.
The most common CRC polynomial is the CCITT form used by IBM's
SDLC protocol. It is of the form:
X**16 + X**12 + X**5 + 1
This means that the input data stream (say, the data being written
to the disk drive) is exclusive ORed into a 16 bit shift register
that has feedback terms at the bit locations with coefficients
in the formula, one bit at a time. Each register bit is a function
of the CRC to that point, the input data, and the feedback taps.
In most systems data is transferred as a serial bit stream. Floppy
and hard disks, as well as the newer optical disks, all write
a single bit at a time to the medium. Modem applications are also
bit oriented. This is fortunate, since the CRC is particularly
well suited to serial data transfers. Serial shift registers,
even with the feedback terms, are easy to implement in silicon.
Fairchild's 74F401 chip is a 14 pin package that will compute
CRCs on serial data using any of eight different polynomials.
Many floppy disk controllers automatically append a CRC to the
data stream. With this technology CRCs are almost trivial to add
to a system.
If the peripheral is a parallel device, the problem becomes much
more complex. Think about it: the CRC is computed by eight shifts
per byte, each shift resulting in the exclusive ORing of a number
of terms within the byte. After 8 of these operations the output
is far from obvious. How do you compute a CRC on 8 bits at once?
The quest for speed is bringing more parallel devices into the
mainstream. An obvious example is a large RAM disk. Most implementations
make the hardware look just like a hard disk, so the operating
system can handle the device without special drivers. Other peripherals
may be connected using SCSI, which almost always is designed to
emulate a disk. Whenever the operating expects to see a disk,
it will also expect a CRC.
How do you implement a CRC in a purely parallel interface? An
obvious approach is to convert the data to serial, compute the
CRC, and convert it back to parallel. Although conceptually easy,
fast data transfers will require a mind-boggling clock rate (at
least 8 times the data rate), and a lot of hardware.
The new CRC (after a byte is shifted in) is a function of the
input data and the old CRC. Why not derive formulas for each of
the 16 bits of the CRC after the 8 new bits are shifted in? Then
all 16 CRC bits can be computed in a single clock cycle.
Deriving the formulas is conceptually easy but rather tedious.
If the input data is b0, b1, ... b7, and the current CRC is a0,
a1, ... a15, then compute the new CRC after one bit arrives. Iterate
seven more times, using new values of a0, a1, ... a15 each time.
You'll get 16 new equations each time a new bit is shifted in.
The last set of 16 is the result; these define the values of each
bit after an entire byte is CRCed. Apply these 16 equations to
a byte of data to compute a new CRC in a single cycle.
It turns out that a number of terms are common to many of the
16 equations. To simplify the algorithm the common terms are identified
as I, J, K, L, M, N, O, and P. Figure 1 shows a program written
in Turbo-C to compute the CRC using the derived formulas.
This program looks horrible! It is far more cumbersome and complicated
than a loop to do the normal serial CRC calculation. Why go through
all of this grief for so little reward?
The form of the program closely resembles that of the equations
needed to define a Programmable Logic Device (PLD). In other words,
there is a very close equivalence between the code and a hardware
implementation of the algorithm. Fortunately, the PLD version
can operate in a single clock cycle, and is much more ascetically
appealing than its awkward software cousin.
The PLD
For the uninitiated, a PLD is a sort of "super PAL".
Based on EPROM technology, the PLD is a device that can be programmed
with a user's equations. Like an EPROM, a fairly simple device
programmer is used to load the formulas.
PLDs are defined in terms of "macrocells". Every macrocell
is a multiple-input OR gate optionally connected to a flip-flop.
Each of the OR gate inputs is an extremely wide AND gate; any
or all of the PLD pins (and their inversions) can be connected
to the AND gates.
You can specify the type of flip-flop; typically D, T, JK, and
SR versions are all available. Or, you can disable the register
altogether, so the macrocell becomes a big combinatorial gate
structure.
A PLD is therefore a very large collection of AND and OR gates
with some internal registers also thrown in. The connection of
these components is up to the user; you write a set of boolean
equations that makes the PLD implement the a circuit you need.
Intel's 5C090 (equivalent to Altera's EP900) is a 40 pin PLD with
24 macrocells. It contains enough logic to completely implement
the CRC algorithm in parallel.
Putting the CRC in a PLD
Examining the program, it quickly becomes apparent that the intermediate
I through P terms must be computed before any of the a0 - a15
outputs, but once I-P are known, then all 16 a0 - a15 terms could
be computed simultaneously. We should therefore assign the I -
P terms to combinatorial outputs in the PLD. The a0 to a15 terms
(the CRC itself) are assigned to registered outputs. In this case
the current CRC is needed as part of the new one; therefore the
flip-flops' output will be fed back into the equation matrix.
It's pretty easy to translate the Turbo-C program into PLD equations.
This was my intention when writing the code; the real purpose
of the program is to provide a simulation of the CRC function.
Figure 2 shows the file that defines the PLD.
For those unfamiliar with PLD design, the NETWORK section of the
PLD file defines the nature of each of the device's pins. b0 to
b7 are input bits. I to P are combinatorial outputs. The a0 to
a15 (CRC) terms are defined as RORF (Registered Output, Registered
Feedback). The terms are latched in D flip-flops, but the current
value (before the device is clocked) goes back into the equation
matrix.
Figure 3 shows the connection of the PLD to a computer's data
bus. Most data communications processors manipulate 8 bit data,
so an 8 bit data bus is shown. The input data is a byte, but the
output CRC is 16 bits. This PLD is designed to let us multiplex
the CRC onto the bus as two individual bytes. The input bits come
from the same data bus. b0 is thus tied to a0 and a8, b1 to a1
and a9, etc.
The LOCRC and HICRC inputs are used to dump the upper or lower
CRC byte onto the bus. Once the calculation is complete, the processor
asserts these inputs low (one at a time) and reads the two bytes.
Typically these inputs are connected as input port selects.
The PRESET input is used to initialize the CRC before any data
is transferred. The CCITT specification requires that the CRC
be initialized to all ones. When asserted, PRESET will load a
one into all of the a0-a15 terms after the next clock cycle.
Be warned! Your interface circuit must assert clock when preset
is also low. Clock is high-edge triggered.
To use the device, first initialize the CRC by asserting PRESET
low and driving clock high. Then transfer data one byte at a time.
For each byte, drive clock high once the input data is stable,
and after the data has been present long enough to let the I to
P terms propagate (typically 100 nsec). The PLD will compute a
new value for the CRC and load it in the a0-a15 registers when
clock goes high. After a block is transferred, the CRC can then
be read by the computer.
This is a pretty complex series of equations. How can you be sure
the circuit works correctly? One way is to compare the CRC to
known good values after specific data is transferred. Table 1
shows CRC values for an input data stream of successive zeroes.
You can check the CRC after each clock against this table.
The CCITT polynomial has a self-checking property. Once a stream
of data has gone through the CRC calculator, you can supply the
one's complement of the resulting CRC to the PLD and always get
F0B8 as the new value. Always. This serves as a sanity check when
developing hardware, and as a quick way of verifying data against
a previously computed CRC during read of a device. Be sure to
transfer the low part of the CRC first, and then the high byte
when making this test.
Summary
The PLD lets you compute a CRC in parallel using only a single
IC package. It can support a data rate of at least 10 Mhz, even
using relatively slow devices. Although the equations are messy
and tedious, you can use the ones presented here as a "cookbook"
solution.
-----------------------------------------------------------
/* compute a parallel CRC */
#include <stdio.h>
int I,J,K,L,M,N,O,P;
int b0,b1,b2,b3,b4,b5,b6,b7;
int a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15;
int iter;
main()
{
int k0,k1,k2,k3,k4,k5,k6,k7;
/* Preset CRC to all ones - CCITT rules */
a0=a1=a2=a3=a4=a5=a6=a7=a8=a9=a10=a11=a12=a13=a14=a15=1;
/* Compute a number of CRCs with input data of all zeroes */
b0=b1=b2=b3=b4=b5=b6=b7=0;
for(iter=0; iter<10; ++iter)crc();
/* Now feed ones complement of the CRC back into calculation;
result should be F0B8 */
b0=!a0; b1=!a1; b2=!a2; b3=!a3; b4=!a4; b5=!a5; b6=!a6; b7=!a7;
k0=!a8; k1=!a9; k2=!a10; k3=!a11; k4=!a12; k5=!a13; k6=!a14; k7=!a15;
crc();
b0=k0; b1=k1; b2=k2; b3=k3; b4=k4; b5=k5; b6=k6; b7=k7;
crc();
}
crc()
{
I=(b3 & !a3) | (!b3 & a3);
J=(b2 & !a2) | (!b2 & a2);
K=(b1 & !a1) | (!b1 & a1);
L=(b0 & !a0) | (!b0 & a0);
M= b7 & !a7 & !I
| !b7 & !a7 & I
| !b7 & a7 & !I
| b7 & a7 & I;
N= b6 & !a6 & !J
| !b6 & !a6 & J
| !b6 & a6 & !J
| b6 & a6 & J;
O= b5 & !a5 & !K
| !b5 & !a5 & K
| !b5 & a5 & !K
| b5 & a5 & K;
P= b4 & !a4 & !L
| !b4 & !a4 & L
| !b4 & a4 & !L
| b4 & a4 & L;
a7=(a15 & !P) | (!a15 & P);
a6=(a14 & !I) | (!a14 & I);
a5=(a13 & !J) | (!a13 & J);
a4=(a12 & !K) | (!a12 & K);
a3= a11 & !L & !M
| !a11 & !L & M
| !a11 & L & !M
| a11 & L & M;
a2=(a10 & !N) | (!a10 & N);
a1=(a9 & !O) | (!a9 & O);
a0=(a8 & !P) | (!a8 & P);
a15=M;
a14=N;
a13=O;
a12=P;
a11=I;
a10=(J & !M) | (!J & M);
a9= (K & !N) | (!K & N);
a8= (L & !O) | (!L & O);
printf("\nCRC= %x%x%x%x %x%x%x%x %x%x%x%x %x%x%x%x",
a15,a14,a13,a12,a11,a10,a9,a8,a7,a6,a5,a4,a3,a2,a1,a0);
}
FIGURE 1: PROGRAM TO COMPUTE CRC IN PARALLEL
-----------------------------------------------------------
Designer: Jack G. Ganssle
Company: Softaid, Inc.
Date: March 17, 1989
Number: 1
Revision:
EPLD: 5C090
Comments: This file computes a CCITT-SDLC CRC in parallel
OPTIONS: TURBO=ON
PART: 5C090
INPUTS: b0@2,b1@3,b2@4,b3@17,b4@18,b5@19,b6@22,b7@23,preset@38,
hicrc@37,locrc@24,clk1@1,clk2@21
OUTPUTS:a0@5,a1@7,a2@9,a3@11,a4@13,a5@15,a6@25,a7@27,
a8@6,a9@8,a10@10,a11@12,a12@14,a13@16,a14@26,a15@28,
I@32,J@31,K@30,L@29,M@36,N@33,O@34,P@35
NETWORK:
b0=INP(b0)
b1=INP(b1)
b2=INP(b2)
b3=INP(b3)
b4=INP(b4)
b5=INP(b5)
b6=INP(b6)
b7=INP(b7)
preset=INP(preset)
losel=INP(locrc)
hisel=INP(hicrc)
clk1=INP(clk1)
clk2=INP(clk2)
a0,a0=RORF(a0d,clk1,gnd,gnd,lo)
a1,a1=RORF(a1d,clk1,gnd,gnd,lo)
a2,a2=RORF(a2d,clk1,gnd,gnd,lo)
a3,a3=RORF(a3d,clk1,gnd,gnd,lo)
a4,a4=RORF(a4d,clk1,gnd,gnd,lo)
a5,a5=RORF(a5d,clk1,gnd,gnd,lo)
a6,a6=RORF(a6d,clk2,gnd,gnd,lo)
a7,a7=RORF(a7d,clk2,gnd,gnd,lo)
a8,a8=RORF(a8d,clk1,gnd,gnd,hi)
a9,a9=RORF(a9d,clk1,gnd,gnd,hi)
a10,a10=RORF(a10d,clk1,gnd,gnd,hi)
a11,a11=RORF(a11d,clk1,gnd,gnd,hi)
a12,a12=RORF(a12d,clk1,gnd,gnd,hi)
a13,a13=RORF(a13d,clk1,gnd,gnd,hi)
a14,a14=RORF(a14d,clk2,gnd,gnd,hi)
a15,a15=RORF(a15d,clk2,gnd,gnd,hi)
L,L=COIF(Ld,)
K,K=COIF(Kd,)
J,J=COIF(Jd,)
I,I=COIF(Id,)
P,P=COIF(Pd,)
O,O=COIF(Od,)
N,N=COIF(Nd,)
M,M=COIF(Md,)
EQUATIONS:
lo=/losel;
hi=/hisel;
Id=(b3 * /a3) + (/b3 * a3);
Jd=(b2 * /a2) + (/b2 * a2);
Kd=(b1 * /a1) + (/b1 * a1);
Ld=(b0 * /a0) + (/b0 * a0);
Md= b7 * /a7 * /I
+ /b7 * /a7 * I
+ /b7 * a7 * /I
+ b7 * a7 * I;
Nd= b6 * /a6 * /J
+ /b6 * /a6 * J
+ /b6 * a6 * /J
+ b6 * a6 * J;
Od= b5 * /a5 * /K
+ /b5 * /a5 * K
+ /b5 * a5 * /K
+ b5 * a5 * K;
Pd= b4 * /a4 * /L
+ /b4 * /a4 * L
+ /b4 * a4 * /L
+ b4 * a4 * L;
a7d=(a15 * /P) + (/a15 * P)
+ /preset;
a6d=(a14 * /I) + (/a14 * I)
+ /preset;
a5d=(a13 * /J) + (/a13 * J)
+ /preset;
a4d=(a12 * /K) + (/a12 * K)
+ /preset;
a3d= a11 * /L * /M
+ /a11 * /L * M
+ /a11 * L * /M
+ a11 * L * M
+ /preset;
a2d=(a10 * /N) + (/a10 * N)
+ /preset;
a1d=(a9 * /O) + (/a9 * O)
+ /preset;
a0d=(a8 * /P) + (/a8 * P)
+ /preset;
a15d=M
+ /preset;
a14d=N
+ /preset;
a13d=O
+ /preset;
a12d=P
+ /preset;
a11d=I
+ /preset;
a10d=(J * /M) + (/J * M)
+ /preset;
a9d= (K * /N) + (/K * N)
+ /preset;
a8d= (L * /O) + (/L * O)
+ /preset;
END$
FIGURE 2: PLD FILE FOR A PARALLEL CRC
----------------------------------------------------------
Input data Output CRC
preset FFFF
0 0F87
0 F0B8
0 3933
0 0321
0 3088
77 0F48
CF F0B8
(Note that feeding one's complement of CRC into the PLD yields
F0B8)
TABLE 1: SAMPLE CRC VALUES
|
| |
The next public class will be in the Fall of 2010. But you can bring this class to your company! Click here to find how we can come to your facility and present the class.
Jack will be speaking at the Embedded Systems Conference in San Jose April 26-29. |
|
| |
 |