Design For Performance
Build a "fast enough" system.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Though my August article "Is Java Ready For Prime Time"
produced a number of interesting responses from readers, one of the more
thoughtful came from Tom Mitchell:
"I have just finished reading "Is Java Ready
For Prime Time?" and it brought to mind a conversation I had recently with
a co-worker. What we were discussing was the usefulness of "low-level"
versus "high-level" technology. We were bemoaning the impression that
lately it seems fewer technical people have much less interest in low-level
details of system design and more interest in the "sexier"
technologies, such as, HTML and Java. For example, how many programmers have
sufficient knowledge of assembly language to use it in a program if they have a
performance bottleneck? How many digital designers know what their VHDL
synthesizer is creating and could they optimize it if they had too?"
"My point is that performance seems to rarely be an
issue. I recently sat in a meeting where a software "engineer"
confidently declared to me that he didn't need to worry about the architecture
of the machine his code would run on because "other people had taken care
of the performance issues." At the time I was trying to convince him that a
more modest system (albeit carefully optimized) than the one he was proposing
was sufficient for the job we were working on."
"Now it may just be me, but that seems to be a
prevalent attitude amongst software developers. Don't worry about speed, the
hardware guys will always come up with faster processors. I'm not talking about
a software versus hardware issue because VHDL proponents take the same approach.
Whatever happened to system design?"
Right on, Tom! It sure seems that in most cases
"system design" - a true analysis of hardware/software
tradeoff issues - is non-existent. I suspect most of us have no idea if a system
will be fast enough (whatever that means!) when we commit to a hardware design.
There's a couple of reasons for this. As the
industry matures more and more people are pure hardware or pure software
engineers. The birth of the embedded industry in the 70s was driven by hardware
people who counted T-states, and hand crafted assembly code to get things done
on impossibly small CPUs. Systems were small, so could tolerate the impossible
spaghetti code that resulted. Now it seems the hardware folks heave a design
over the moat to the firmware crowd, who struggle with too little knowledge of
how the system works, and sometimes even less knowledge of the tradeoffs that
could have been made to tune system performance.
The PC industry exacerbates the problem. The pursuit
of desktop compute horsepower spills over into embedded systems. When everything
from PC Week to Time Magazine to Playboy touts the virtues of 64 bit processing
with zigabytes of cache and infinite DRAM, when we're all used to having (in
effect) a mainframe on our desks, it's hard to get worked up about performance
issues.
FBI agents follow the money to understand the crime -
not a bad metaphor for us. Consider the dynamics of the processor market:
billions of 8 bit CPUs make it into small systems every year. Tens of millions
of big chips ship in PCs. Those 8 bitters are selling for a pittance - many for
under a buck. Massive processors go for a huge sum, with very high profit
margins. Clearly the chip vendors have an interest in spreading the gospel of
solving performance problems by tossing in more CPU horsepower.
A conspiracy? Nah - just good business sense.
Don't get me wrong: by all means use as much CPU as
you need. Systems are indeed much more complex than two decades ago, and
sometimes require a massive dose of horsepower. Do, though, use
normal">appropriate technology. Use a rational decision process to select
just enough CPU for your system. Use a software design that complements the
hardware, with optimized critical routines, designed for speed from the outset.
Software First
In a perfect world you'd design the code first, and
select the processor after understanding the code's constraints. Perhaps this
happens on the planets circling Beta Pictoris. On Earth, though,
the CPU choice is usually made based on a vague feeling of address
space/performance requirements, tempered by inventory issues ("we use the part
in all of our other products, so will use it again"), price ("the product must
cost less than twelve bucks"), and expertise ("our team already understands
the 8080, so we'll keep using it)".
Though we may justifiably complain about this
injustice done by management to our design, in fact it's rare we understand
the performance issues enough to make an informed processor decision. Just how
fast will that 20,000 lines of C run on an 8051, or on a 68040? The compiler
alone abstracts us from the hardware so efficiently that we generally have no
idea if even the simplest for statement completes in a microsecond or a
millisecond. Does it call a runtime routine? No one knows, without examining the
generated code.
Except in extreme circumstances, though, wading
through the compiler's assembly listings is a bad idea. The virtue of C - or
any other high level tool - is to insulate us from the details. Once we dive
back into the tool's output we've all but abandoned the advantages the tool
brings.
Now, there's nothing wrong in developing an
understanding of a tool's operation. For time-critical applications it might
make sense to study your compiler. Run some sample code through it and see what
it does generate. Develop a short document with typical examples, so you know
prior to writing thousands of lines of code what to expect. Then the compiler is
more or less predictable, and you can pretty much understand it.
It's a shame the vendors don't do this for us!
Wouldn't it be nice to know that an integer compare eats 20 T-states? Or that
a floating point add burns between 100 and 250 T-states?
On complex processors even reading the assembly
brings little information about raw execution speed, as caching, prefetching,
and pipelining all conspire to radically alter instruction timing based on
events too complex for us to understand. (This is the excuse used by many
compiler vendors for not providing timing info).
Yet, most embedded systems use small, 8 and 16 bit
processors, with very limited subsets of these speed-improving complexifiers.
Though a state-of-the art processor dooms you to tremendous uncertainty, most of
us deal with systems that are quite deterministic.
A simple CPU has very predictable timing. Add a
prefetcher or pipeline and timing gets fuzzier, but still is easy to figure
within 10 or 20 percent. Cache is the wildcard, and as cache size increases
determinism diminishes. Thankfully, today few small embedded CPUs have even the
smallest amount of cache.
Your first weapon in the performance arsenal is
developing an understanding of the target processor. What can it do in one
microsecond? One instruction? Five? Some developers use very, very slow clocks
when not much has to happen - one outfit I know runs the CPU (in a spacecraft)
at 8 KHz until real speed is needed. At 8 KHz they get maybe 1000 instructions
per second. Even small loops become a serious problem. Understanding the physics
- a perhaps fuzzy knowledge of just what the CPU can do at this clock rate -
means the big decisions are easy to make.
Estimation is one of engineering's most important
tools. Do you think the architect
designing a house does a finite element analysis to figure the size of the
joists? No! He refers to a manual of standards. A 15 foot unsupported span
typically uses joists of a certain size. These estimates, backed up with
practical experience, insure that a design, while perhaps not optimum, is
adequate.
We do the same in hardware engineering. Electrons
travel at about one or two feet per nanosecond, depending on the conductor.
It's hard to make high frequency first harmonic crystals, so use a higher
order harmonic. Very small PCB tracks are difficult to manufacture reliably. All
of these are ingredients of the "practice" of the art of hardware design.
None of these are tremendously accurate: you can, after all, create one mil
tracks on a board for a ton of money. The exact parameters are fuzzy, but the
general guidelines are indeed correct.
So to for software engineering. We need to develop a
sense of the art. A 68HC16, at 16 MHz, runs so many instructions per second
(plus or minus). With this particular compiler you can expect (more or less)
this sort of performance under these conditions.
Data, even fuzzy data, lets us bound our decisions,
greatly improving the chances of success. The alternative is to spend months and
years generating a mathematically precise solution - which we won't do - or to
burn incense and pray! the usual approach.
Models
Another sense of the art we must cultivate is a range
of well understood canned solutions. On comp.arch.embedded recently quite a few
posters responded to a question about fast 8 and 16 bit CRC algorithms. I sucked
several of the more interesting down and stuck them in my library.
Why re-invent the wheel every time? These algorithms may be big (in some
cases), but are fast and predictable. Reading the code I get a sense of their
speed. The first time I fire one up I'll get a quantifiable measure, one that
I'll associate with that algorithm forever. Change CPUs, change clock rates,
and the measures shift, but by more or less predictable amounts.
Stealing algorithms is useful but admittedly deals
with only a portion of an entire new system. You can plug the routines into your
design, but there's always a ton of new code whose behavior may not be well
understood.
Experiment. Run portions of the code. Use a stopwatch
- metaphorical or otherwise - to see how it executes. Buy a performance analyzer
or simply instrument sections of the firmware (see my June article for ideas
about measuring ISR performance with a scope) to understand the code's
performance.
The first time you do this you'll think "this is
so cool", and you'll walk away with a clear number: xxx microseconds for
this routine. With time you'll develop a sense of speed. "You know, integer
compares are pretty damn fast on this system." Later - as you develop a sense
of the art - you'll be able to bound things. "Nah, there's no way that
loop can complete in 50 microseconds."
This is called experience, something that we all too
often acquire haphazardly. We plan our financial future, we work daily with our
kids on their homework, even remember to service the lawnmower at the beginning
of the season, yet neglect to proactively improve our abilities at work.
Experience comes from exposure to problems and from
learning from them. A fast, useful sort of performance expertise comes from
extrapolating from a current product to the next. Most of us work for a company
that generally sells a series of
similar products. When it's time to design a new one we draw from the
experience of the last, and from the code and design base. Building version 2.0
of a widget? Surely you'll use algorithms and ideas from 1.0. Use 1.0 as a
testbed. Gather performance data by instrumenting the code.
Always close the feedback loop! When any project is
complete, spend a day learning about what you did. Measure the performance of
the system to see just how accurate your processor utilization estimates were.
The results are always interesting and sometimes terrifying. If, as is often the
case, the numbers bear little resemblance to the original goals, then figure out
what happened, and use this information to improve your estimating ability.
Without feedback, you work forever in the dark. Strive to learn from your
successes as well as your failures.
Track your system's performance all during the
project's development, so you're not presented with a disaster two weeks before
the scheduled delivery. It's not a bad idea to assign CPU utilization
specifications to major routines during overall design, and then track these
targets like you do the schedule. Avoid surprises with careful planning.
A lot of projects eventually get into trouble by
overloading the processor. This is always discovered late in the development,
during debugging or final integration, when the cost of correcting the problem
is at the maximum. Then a mad scramble to remove machine cycles begins.
We all know the old adage that 90% of the processor
burden lies in 10% of the code. It's important to find and optimize that 10%,
not some other section that will have little impact on the system's overall
performance. Nothing is worse than spending a week optimizing the wrong routine!
If you understand the design, if you have a sense of
the CPU, you'll know where that 10% of the code is before you write a line.
Knowledge is power.
Learn about your hardware. Pure software types often
have no idea that the CPU is actively working against them. I talked to an
engineer lately who was moaning about how slow his new 386EX-based instrument
runs. He didn't know that the 386EX starts with 31 wait states! and so had
never reprogrammed it to a saner value.
Conclusion
In the 1970s Motorola sold a one bit CPU. Given
enough time and address space it could do pretty much anything our embedded
processors do today. Its performance was adequate for many applications,
especially when coupled with appropriate
software.
If there was a procedure, a checklist, that we could
follow that insured fast enough code I'd recommend we all slavishly follow it.
There ain't, and it appears no such silver bullet will appear any time soon.
Our only option is to design carefully, to measure system speed often, and to
develop a sense of the art.
|