Design For Performance

Build a "fast enough" system.

By Jack Ganssle

Though my August article "Is Java Ready For Prime Time" produced a number of interesting responses from readers, one of the more thoughtful came from Tom Mitchell:

"I have just finished reading "Is Java Ready For Prime Time?" and it brought to mind a conversation I had recently with a co-worker. What we were discussing was the usefulness of "low-level" versus "high-level" technology. We were bemoaning the impression that lately it seems fewer technical people have much less interest in low-level details of system design and more interest in the "sexier" technologies, such as, HTML and Java. For example, how many programmers have sufficient knowledge of assembly language to use it in a program if they have a performance bottleneck? How many digital designers know what their VHDL synthesizer is creating and could they optimize it if they had too?"

"My point is that performance seems to rarely be an issue. I recently sat in a meeting where a software "engineer" confidently declared to me that he didn't need to worry about the architecture of the machine his code would run on because "other people had taken care of the performance issues." At the time I was trying to convince him that a more modest system (albeit carefully optimized) than the one he was proposing was sufficient for the job we were working on."

"Now it may just be me, but that seems to be a prevalent attitude amongst software developers. Don't worry about speed, the hardware guys will always come up with faster processors. I'm not talking about a software versus hardware issue because VHDL proponents take the same approach. Whatever happened to system design?"

Right on, Tom! It sure seems that in most cases "system design" - a true analysis of hardware/software tradeoff issues - is non-existent. I suspect most of us have no idea if a system will be fast enough (whatever that means!) when we commit to a hardware design.

There's a couple of reasons for this. As the industry matures more and more people are pure hardware or pure software engineers. The birth of the embedded industry in the 70s was driven by hardware people who counted T-states, and hand crafted assembly code to get things done on impossibly small CPUs. Systems were small, so could tolerate the impossible spaghetti code that resulted. Now it seems the hardware folks heave a design over the moat to the firmware crowd, who struggle with too little knowledge of how the system works, and sometimes even less knowledge of the tradeoffs that could have been made to tune system performance.

The PC industry exacerbates the problem. The pursuit of desktop compute horsepower spills over into embedded systems. When everything from PC Week to Time Magazine to Playboy touts the virtues of 64 bit processing with zigabytes of cache and infinite DRAM, when we're all used to having (in effect) a mainframe on our desks, it's hard to get worked up about performance issues.

FBI agents follow the money to understand the crime - not a bad metaphor for us. Consider the dynamics of the processor market: billions of 8 bit CPUs make it into small systems every year. Tens of millions of big chips ship in PCs. Those 8 bitters are selling for a pittance - many for under a buck. Massive processors go for a huge sum, with very high profit margins. Clearly the chip vendors have an interest in spreading the gospel of solving performance problems by tossing in more CPU horsepower. A conspiracy? Nah - just good business sense.

Don't get me wrong: by all means use as much CPU as you need. Systems are indeed much more complex than two decades ago, and sometimes require a massive dose of horsepower. Do, though, use normal">appropriate technology. Use a rational decision process to select just enough CPU for your system. Use a software design that complements the hardware, with optimized critical routines, designed for speed from the outset.

Software First

In a perfect world you'd design the code first, and select the processor after understanding the code's constraints. Perhaps this happens on the planets circling Beta Pictoris. On Earth, though, the CPU choice is usually made based on a vague feeling of address space/performance requirements, tempered by inventory issues ("we use the part in all of our other products, so will use it again"), price ("the product must cost less than twelve bucks"), and expertise ("our team already understands the 8080, so we'll keep using it)".

Though we may justifiably complain about this injustice done by management to our design, in fact it's rare we understand the performance issues enough to make an informed processor decision. Just how fast will that 20,000 lines of C run on an 8051, or on a 68040? The compiler alone abstracts us from the hardware so efficiently that we generally have no idea if even the simplest for statement completes in a microsecond or a millisecond. Does it call a runtime routine? No one knows, without examining the generated code.

Except in extreme circumstances, though, wading through the compiler's assembly listings is a bad idea. The virtue of C - or any other high level tool - is to insulate us from the details. Once we dive back into the tool's output we've all but abandoned the advantages the tool brings.

Now, there's nothing wrong in developing an understanding of a tool's operation. For time-critical applications it might make sense to study your compiler. Run some sample code through it and see what it does generate. Develop a short document with typical examples, so you know prior to writing thousands of lines of code what to expect. Then the compiler is more or less predictable, and you can pretty much understand it.

It's a shame the vendors don't do this for us! Wouldn't it be nice to know that an integer compare eats 20 T-states? Or that a floating point add burns between 100 and 250 T-states?

On complex processors even reading the assembly brings little information about raw execution speed, as caching, prefetching, and pipelining all conspire to radically alter instruction timing based on events too complex for us to understand. (This is the excuse used by many compiler vendors for not providing timing info).

Yet, most embedded systems use small, 8 and 16 bit Though a state-of-the art processor dooms you to tremendous uncertainty, most of us deal with systems that are quite deterministic.

A simple CPU has very predictable timing. Add a prefetcher or pipeline and timing gets fuzzier, but still is easy to figure within 10 or 20 percent. Cache is the wildcard, and as cache size increases determinism diminishes. Thankfully, today few small embedded CPUs have even the smallest amount of cache.

Your first weapon in the performance arsenal is developing an understanding of the target processor. What can it do in one microsecond? One instruction? Five? Some developers use very, very slow clocks when not much has to happen - one outfit I know runs the CPU (in a spacecraft) at 8 KHz until real speed is needed. At 8 KHz they get maybe 1000 instructions per second. Even small loops become a serious problem. Understanding the physics - a perhaps fuzzy knowledge of just what the CPU can do at this clock rate - means the big decisions are easy to make.

Estimation is one of engineering's most important tools. Do you think the architect designing a house does a finite element analysis to figure the size of the joists? No! He refers to a manual of standards. A 15 foot unsupported span typically uses joists of a certain size. These estimates, backed up with practical experience, ensure that a design, while perhaps not optimum, is adequate.

We do the same in hardware engineering. Electrons travel at about one or two feet per nanosecond, depending on the conductor. It's hard to make high frequency first harmonic crystals, so use a higher order harmonic. Very small PCB tracks are difficult to manufacture reliably. All of these are ingredients of the "practice" of the art of hardware design. None of these are tremendously accurate: you can, after all, create one mil tracks on a board for a ton of money. The exact parameters are fuzzy, but the general guidelines are indeed correct.

So to for software engineering. We need to develop a sense of the art. A 68HC16, at 16 MHz, runs so many instructions per second (plus or minus). With this particular compiler you can expect (more or less) this sort of performance under these conditions.

Data, even fuzzy data, lets us bound our decisions, greatly improving the chances of success. The alternative is to spend months and years generating a mathematically precise solution - which we won't do - or to burn incense and pray! the usual approach.

Models

Another sense of the art we must cultivate is a range of well understood canned solutions. On comp.arch.embedded recently quite a few posters responded to a question about fast 8 and 16 bit CRC algorithms. I sucked several of the more interesting down and stuck them in my library. Why re-invent the wheel every time? These algorithms may be big (in some cases), but are fast and predictable. Reading the code I get a sense of their speed. The first time I fire one up I'll get a quantifiable measure, one that I'll associate with that algorithm forever. Change CPUs, change clock rates, and the measures shift, but by more or less predictable amounts.

Stealing algorithms is useful but admittedly deals with only a portion of an entire new system. You can plug the routines into your design, but there's always a ton of new code whose behavior may not be well understood.

Experiment. Run portions of the code. Use a stopwatch - metaphorical or otherwise - to see how it executes. Buy a performance analyzer or simply instrument sections of the firmware (see my June article for ideas about measuring ISR performance with a scope) to understand the code's performance.

The first time you do this you'll think "this is so cool", and you'll walk away with a clear number: xxx microseconds for this routine. With time you'll develop a sense of speed. "You know, integer compares are pretty damn fast on this system." Later - as you develop a sense of the art - you'll be able to bound things. "Nah, there's no way that loop can complete in 50 microseconds."

This is called experience, something that we all too often acquire haphazardly. We plan our financial future, we work daily with our kids on their homework, even remember to service the lawnmower at the beginning of the season, yet neglect to proactively improve our abilities at work.

Experience comes from exposure to problems and from learning from them. A fast, useful sort of performance expertise comes from extrapolating from a current product to the next. Most of us work for a company that generally sells a series of similar products. When it's time to design a new one we draw from the experience of the last, and from the code and design base. Building version 2.0 of a widget? Surely you'll use algorithms and ideas from 1.0. Use 1.0 as a testbed. Gather performance data by instrumenting the code.

Always close the feedback loop! When any project is complete, spend a day learning about what you did. Measure the performance of the system to see just how accurate your processor utilization estimates were. The results are always interesting and sometimes terrifying. If, as is often the case, the numbers bear little resemblance to the original goals, then figure out what happened, and use this information to improve your estimating ability. Without feedback, you work forever in the dark. Strive to learn from your successes as well as your failures.

Track your system's performance all during the project's development, so you're not presented with a disaster two weeks before the scheduled delivery. It's not a bad idea to assign CPU utilization specifications to major routines during overall design, and then track these targets like you do the schedule. Avoid surprises with careful planning.

A lot of projects eventually get into trouble by overloading the processor. This is always discovered late in the development, during debugging or final integration, when the cost of correcting the problem is at the maximum. Then a mad scramble to remove machine cycles begins.

We all know the old adage that 90% of the processor burden lies in 10% of the code. It's important to find and optimize that 10%, not some other section that will have little impact on the system's overall performance. Nothing is worse than spending a week optimizing the wrong routine!

If you understand the design, if you have a sense of the CPU, you'll know where that 10% of the code is before you write a line. Knowledge is power.

Learn about your hardware. Pure software types often have no idea that the CPU is actively working against them. I talked to an engineer lately who was moaning about how slow his new 386EX-based instrument runs. He didn't know that the 386EX starts with 31 wait states! and so had never reprogrammed it to a saner value.

Conclusion

In the 1970s Motorola sold a one bit CPU. Given enough time and address space it could do pretty much anything our embedded processors do today. Its performance was adequate for many applications, especially when coupled with appropriate software.

If there was a procedure, a checklist, that we could follow that insured fast enough code I'd recommend we all slavishly follow it. There ain't, and it appears no such silver bullet will appear any time soon. Our only option is to design carefully, to measure system speed often, and to develop a sense of the art.