What Happens at 100 Mhz?

As CPU speeds climb towards infinity, our debugging strategies must change. Here are some ideas circa 1995.

By Jack Ganssle

In "From the Earth to the Moon" Jules Verne painted a wonderful image of the battle between canon manufacturers and armor makers. Armorers always operate in a catch-up mode, trying desperately to build something resistant to the newest speedy shell. When a group builds a gigantic canon to shoot a man to the moon the armourers all but revolt, complaining that nothing could keep up with this latest technological advance.

Every time I pick up an electronics magazine I'm reminded of this story. The chip vendors announce in glittering prose their latest masterpiece of silicon speediness, mostly without a comment about how the poor designer should develop code for it. In real life I run a business that makes emulators, and am left feeling a bit like those poor armourers of old.

At some point - 100, 200, or surely by 300 Mhz - the tools we have relied on for all these years just won't keep up.

Consider my old favorite, the in-circuit emulator (ICE). Some sort of hardware replaces the CPU in the product you're working on. Electronics - a lot of electronics - uses either the same sort of CPU, or a special "bond-out" version, to control your system.

At reasonable speeds this is probably the best way to get visibility into your code and hardware. The ICE is intrusive, in that it can toggle bits and exercise your memory and I/O independent of the execution of your code. In other words, you can stop the code and examine the state of every part of the system. Conversely, a decent ICE runs your code non-intrusively. It's as if the emulator is the world's most expensive microprocessor, running your code exactly as it should. Perhaps breakpoints are pending or trace is armed; these debug features either have no impact on the code, or at least none till they take effect.

This dual identity - intrusive and non-intrusive behavior - comes from two characteristics of the emulator. The first is bus isolation: the microprocessor in the ICE is physically separated from your target system by a data bus buffer. Turn it on, and every bus cycle generated by the CPU on the ICE's pod is mirrored to your own electronics. The emulator does this to access a target resource.

Turn it off, and the CPU can run ICE-specific code that your target never sees. Perhaps the emulator's hardware/firmware is extracting the contents of the registers so you can see these in a window on your terminal. Maybe it's getting ready to read one byte from your target.

The second characteristic is simply the sad fact that features (breakpoints, trace, and the like) have a cost. An ICE uses quite a bit of electronics that have to be mounted somewhere, somehow. As you're not likely to design an emulator into your product, we emulator vendors design them as instruments you plug into the CPU socket.

At low speeds there's no problem. Once the clock rate starts to climb to the crazy levels we're seeing on the horizon the physical size of the system - a pod, a connection via (perhaps) an adapter to your target - becomes large compared to the speed of light. Every connector, connection, and PC board track each impose rather significant additional delays as the signal propagates from your target up to the guts of the ICE.

Electrons move at around 2 nsec per foot in wire. We just can't push them faster. Perhaps with new understandings of quantum mechanics we'll one day exploit the "spooky action at a distance" (Einstein's words) beyond-lightspeed collapse of the quantum wave function to transmit data between the target and ICE infinitely fast. Since this seems unlikely is our lifetimes, we'll have to bow to reality and develop alternative approaches.

The previously-mentioned data bus buffer is a source of headaches as well. The very fastest devices need over 2 nsec to switch directions. If you add perhaps another 2 nsec for prop times through the connections, 40% of the cycle of a 100 Mhz one T-state system is eaten up just moving the data between target and ICE.

Things degenerate quickly at higher speeds. A 200 Mhz single T-state CPU has a 5 nsec bus cycle. Though one can argue that cache designs reduce external bus rates to more reasonable levels, few designers are willing to give up breakpoints in cache, or real time trace, just to make life easier for the tools.

So, like chicken little, I'm predicting that the roof will eventually fall in on high speed embedded design. There are some options. First, though, it's important to consider what very high speed embedded systems imply.

More Speed Now!

I'm told that a number of embedded apps will always push the performance envelope. V.34 modems apparently already need 30+ Mhz 8 bit CPUs. Disk drives will eat every bit of performance available, as the ability to suck data from the platter in real time reduces disk cache memory sizes... and memory is expensive.

Intel preaches the virtues of raw horsepower to reduce system costs by eliminating the need for external DSP chips in fax machines, cellular phones, and other communications devices.

Real time compression and encryption needs ever faster processors, especially as data comm rates continue to increase. Of course, the Feds have their way, we'll always be limited to inadequate 64 bit keys (they'll have a copy tucked away, just in case), which won't demand as much CPU performance as the 1024 bitters so many of us use now via PGP.

Inexpensive 16 bit CPUs at speeds of 40 Mhz are available today. 8 bitters passed 20 Mhz years ago. Raw clock rate specifications, though, mean little.

Though the chip vendors are excruciatingly honest about specifying their clocks as real CPU bus rate, many designers still don't understand that the crystal frequency may be twice the bus rate. Most processors divide the crystal by two. That 40 Mhz crystal, then, may be driving your processor at a more sedate 20 Mhz. It's hard to build very high frequency crystals, so many of the speediest CPUs divide the input by 1. Some don't accept a crystal. On the 386/486/Pentium, for example, a single clock pin accepts a perfect TTL waveform only, generated by an external oscillator.

Others multiply the input using a phase locked loop. Many of Motorola's chips can use a wristwatch crystal - 32768 Hertz - to run at full rated 16 Mhz. This has two advantages. Watch crystals are cheap and very small. Even better, the 32.768 Khz input is ideal for tracking the time of day in the real time clock modules included on many of these parts.

Count on continued confusion when comparing raw bus rates. Only the newest and fastest parts do anything useful in a single clock cycle. The 68332, for instance, needs 3 clocks to read anything, including an instruction, from memory. The 186 needs 4. Zilog's Z180 is a bigger/faster/better version of the Z80; it uses a 3 T-state bus - one less than the Z80 - immediately bettering performance by 25%.a

One of the promises of RISC is to simplify instruction sets to the point where each can run in one cycle. CISC machines are evolving in this direction as well, as we see with the Pentium and other high-end speed demons.

A one T-state processor running at 50 Mhz needs only 20 nsec to read memory, assuming there are no setup/hold time requirements. More common embedded CPUs require 2, 3 or 4 T-states, increasing basic machine timing to 40 to 80 nsec. Clearly, when the holy grail of performance is the only consideration, a one T-state machine is the way to go.

There's a bit of a dirty secret though: few of us can afford large amounts of zero wait state memory when the entire machine cycle races by in 20 nsec (or 10 nsec, at 100 Mhz). The solution is cache, which is a bit of very expensive but very fast RAM. All off-cache accesses use one or more wait states, greatly slowing the system down (a single wait state on a 1 T-state machine halves system throughput on that cycle).

Cost sensitive embedded systems simply must minimize memory expenses. Those extra T-states start to look pretty attractive when you've got a tiny budget for computer hardware. Thus, memory cost has been a limiting factor on the speed of most embedded systems.

AMD pulled a neat trick with their Am186EM, which will give zero wait state accesses to 70 nsec RAMs at CPU bus speeds of 40 Mhz. It's not the fastest part in town, but the complete system cost sure is attractive.

Speed Cometh

Despite the memory problem, the very fastest embedded systems are now using, or will soon use, single T-state processors at outrageous clock rates. The applications listed above all need the best performance, and demand low costs to boot. Once I would have said these were the pathological cases; the ones with little impact on what most of the industry is doing. Perhaps this is changing.

Very high speed, very high volume, embedded systems are now becoming possible due to the excess fab capacity of the chip vendors. Many are actively designing CPUs that will be used only by a single customer - say, a laser printer sold in enormous quantities via the discount mail order computer houses. Speed, as a replacement for complicated electronics or memory, in high volume applications is starting to make economic sense.

It's not entirely clear how we'll deal with expensive memory. Clearly on-board cache will be ever more common. Just as clearly, memory costs do follow a descending curve, though it always seems to lag the needs of the speediest processors. Try buying very fast static RAM today - it's practically all on allocation due to the enormous demands for fast external cache for the PC industry.

Since the rest of the industry seems to live with the spin-offs (or, perhaps the cast-offs) of the racehorses, I suspect that as time goes by more and more embedded systems, even those produced in medium to low volumes, will use these very fast parts now being developed for single applications.

"Wait a minute," you exclaim, "4 and 8 bit CPUs account for the lion's share of the embedded processors sold each year. Most run at pathetically slow speeds."

True. But take a closer look at that 4 bit market. Most are custom or semi-custom parts tuned to a particular application - an appliance or TV remote control. These parts already herald some of the problems we're bound to see. Though speed isn't an issue, tools certainly are. There cannot be a healthy, competitive tool market for a processor used by a single customer. Engineers are developing at the low end of the market using heroic efforts, not the latest in technology aids.

Yes - the 8051, Z80, 6805, and other workhorses of the 8 bit arena will never clock at hundred Mhz rates, so a significant chunk of low end systems will always be slow. I simply contend that time will create more applications for embedded systems; that these will require ever more horsepower; and that many will run at breakneck speeds or with custom parts that have no decent tool support.

As designers, and as an industry, it's time to start coming up with development strategies for the future. We're fast approaching the end of our ability to tune and drag current development techniques along with the evolving direction of modern microprocessors.

Spoiled Rotten?

Maybe we've been spoiled. Real time trace, hardware performance analysis, complex breakpoints - all contribute to easing the pain of development. All help us get our product to market faster.

Yet, very few programmers use these sorts of tools. Most developers write non-embedded code that runs on PCs and workstations. Somehow they manage without the cool stuff we demand. Though most of these applications don't service interrupts and respond to real time events, some do. This, folks, is the future of our industry.

Don't get me wrong: I believe that programmers are a very expensive resource; time to market is critical. Any tool that increases efficiency is worth it's weight in gold. Too many companies short-change developers with lousy equipment and noisy cubicles while doling our millions in salaries.

Raw CPU speed alone will make traditional embedded tools an impossibility. On-board cache, semi-custom ASICs with on-board processors and superscaler designs will seal their fate.

Motorola's Background Debug Mode (BDM) is a glimpse into the future. All of their more recent parts include a special serial port used only for debugging. Transistors are cheap - it makes sense to integrate extra onto the processor as a special debug port.

Similarly, many other vendors are putting variants of JTAG (IEEE 1149.1) ports onboard their fastest CPUs. Like the BDM these are all special serial interfaces just for the sake of debugging code (and, perhaps, in-circuit test of production boards).

A debugger on-board the chip eliminates all speed issues. It's cache-independent. Even when the CPU is hidden in a huge ASIC, if just a few pins come out for the serial debugger, then designers will have some ability to troubleshoot their code.

JTAG/BDM lets you set simple breakpoints, single step, dump and change memory and I/O... in short, everything you can do with a normal PC design environment, like Microsoft's Visual C++.

The downside is what we'll lose. Real time trace is all but out - you'll never find a chip with megabytes of fast on-board RAM just to make debugging easier. Yes, some chip vendors are experimenting with small on-board trace memories and other clever approaches to give some sort of real time visibility to the code, but all of these at best are compromises. Real performance analysis, overlay RAM, complex breakpoints, and all the rest will be history.

And so, I predict that high speed embedded design will get even harder, limited by physical properties of the chips. We'll have all of the debug capability used by non-embedded programmers, but will lose the neat stuff we rely on for real time design.

The "good" news (for troops in the trenches) - increased job security. Development will be harder and take longer, requiring more skilled people.

The real battles will heat up in software tools. Source debuggers that drive the JTAG port will be our primary tool. New, cool ways of tracking real time events will be invented. Perhaps we'll instrument our code with calls to track execution time. Simulation may finally come into its own as an adjunct to conventional debugging.

Stay tuned. Watch the market; things will continue to change. The cannon folks are pulling ahead of the armourers.