Perform or Perish
How do you deal with software performance problems?
Published in Embedded Systems Programming, March 1991
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
What processor should you use in your next product? High tech
marketers push dozens of common CPU families, some with yet more
dozens of variants within the family. Do you need 8, 16, or 32
bits? How big of an address space? How much on-board I/O? It's
hard enough to pick a processor for a simple controller when cost
is the only concern. What about more complex applications that
make substantially more demands on the computer's raw computational
capability?
Processor performance seems to be the Holy Grail of our business.
The silicon wizards give us ever more MIPs for less and less money.
But just how much is enough for the project at hand? Who can afford
to select a chip, design hardware and tens of thousands of lines
of code, and then discover that the system just can't keep up
with real time events?
I don't know how to prove that a particular processor will be
adequate for a non-trivial task. It's awfully hard to equate a
less-than-adequate design specification with a MIP rating. This
is even tougher when programming in a high level language, or
when using a canned real time operating system.
In real life it seems that the hardware designers often pick the
CPU based on cost and peripheral compatability, and then tell
the programmers to "make it work". Somehow. Or, when
software folks get involved in the decision, the choice sometimes
winds up being made purely on the basis of experience. "The
6501 is in all of our products; surely we'll have no trouble with
it in this one".
In fact, a lot of projects eventually get into trouble by overloading
the processor. This is always discovered late in the development,
during debugging or final integration, when the cost of correcting
the problem is at the maximum. Then a mad scramble to remove machine
cycles begins.
We all know the old adage that 90% of the processor burden lies
in 10% of the code. It's important to find and optimize that 10%,
not some other section that will have little impact on the system's
overall performance. Nothing is worse than spending a week optimizing
the wrong routine!
Fortunately, performance analysis tools and techniques abound.
These range from a head-scratching session to $20k+ special purpose
instruments. Do use the right tools (whether brainpower or technology)
to find speed problems, and then optimize only where absolutely
necessary. Otherwise, you'll be like a blind man hunting for a
contact lens.
Performance Tools
Dozens of vendors stand ready to supply you with hardware-based
performance analysis tools. Some come as add-ons to emulators;
others are stand-alone products dedicated to the sole function
of identifying performance bottlenecks. All offer easy, real time
capture of your program's activity. The only (!) drawback is cost,
which ranges from a few thousands dollars for an ICE add-on to
over $20k for a dedicated instrument.
Hardware performance analyzers come in two very different flavors.
Statistical units sample the target system's address lines every
so often, logging the address found at each sample. Presumably,
after some number of iterations a representative picture of the
target's operation will emerge. Non-statistical instruments capture
every single machine cycle issued by your program. These are more
expensive than their statistical counterparts, but give an accurate
execution profile of even single-shot events (like an infrequently-invoked
subroutine).
Any of these performance analyzers are particularly good at prying
unexpected information from your program. For example, many 8
bit C compilers generate vast quantities of code to handle automatic
variables, since the stack capabilities of the CPUs leave lots
to be desired. A hardware performance analyzer will instantly
show up this overzealous stack use as a peak in processor use.
Of course, not everyone can afford the cost of such a special
purpose instrument; it's surprising just how many programmers
still make do with primitive "burn and hope" strategies,
eschewing all development aids. Since most people spend relatively
little time solving performance problems, it often makes sense
to use cheaper, more readily available tools.
Several companies promote software based performance analysis.
Paradigm's Inside and Borland's Turbo Profiler are possibly the
best known. Both operate by instrumenting compiled code to track
its execution.
Software based performance analyzers are a great compromise between
price and capability. They offer high accuracy and simple operation
at an absurdly low cost. Unfortunately, the products mentioned
operate only on 80x88 series processors, and both expect DOS services
support. This does limit their usefulness. Fortunately more and
more embedded systems use Intel-series CPUs or their NEC near
equivalents. The wide availability of inexpensive high quality
compilers and assemblers are certainly a compelling argument in
their favor.
You can often take advantage of software performance analyzers
even if your embedded environment does not have DOS resources.
Most of the code of any system is tied up in mathematical algorithms,
screen handlers, and the like. It makes sense to prototype and
measure the performance of all of these under MS-DOS before switching
to the less stable development hardware.
Any software performance analyzer will interfere to some extent
with the code's speed. It's a little like Heisenberg's Uncertainty
Principle, except that instead of not knowing both the mass and
momentum of a particle, the measurement of routine speed skews
the system's real time behavior. Interrupt response, in particular,
will be degraded. Still, you can generally measure enough of a
system's response to get a pretty good idea where the biggest
problems are.
Tricks and Techniques
Part of the fun of embedded programming is creating clever windows
into a system's operation. It's awfully hard to extract any sort
of information from most embedded systems; performance data is
especially challenging.
Even if you have access to the latest and most expensive marvel
of performance measurement it makes sense to understand and use
other, perhaps less sophisticated, ways to get the same or similar
data. Always strive to expand your arsenal of tools!
Pure software types always think I'm a little daft whenever I
say this, but the oscilloscope, colloquially known as the scope,
is one of the most useful debugging tools. After all, in the embedded
business our code ultimately just pushes bits out I/O ports; scopes
are a natural choice for measuring the timing of these signals.
A scope is the ideal low-cost performance measurement tool.
Any decent scope shows the state of two independent signals simultaneously,
as a graph of amplitude (i.e., voltage) versus time. Software
people don't really care about the vertical axis calibration since
any signal, regardless of voltage, is (for our purposes) as good
as any other. The horizontal axis shows time, the crucial component
needed in effective performance measurements.
Suppose you wrote an interrupt service routine in C. If more than
a handful of lines, you'll probably have no idea how long it runs.
Yet, an ISR that is too slow will never keep up with real time
events. Modify the routine to set a spare parallel bit high when
invoked, and drive it low just before exiting. Then, connect a
scope to that bit and measure the duration of the pulse. With
only a few seconds work you'll know just how long the ISR executes.
Also measure the time between pulses. This is the time available
for all of the other code. If the ISR runs for 50 out of every
100 milliseconds, then it uses 50% of all CPU resources, a rather
scary amount.
If, in this case, you happen to know the interrupt occurs every
25 milliseconds, then this simple measurement immediately indicates
that the ISR takes so long to run that the system is missing interrupts.
This is profound, as a missed interrupt usually spells disaster
in real time systems. A scope gives the information in a heartbeat;
few other tools will.
You can time any routine this way. Of course, if (gasp!) your
routines use multiple entry and return points then you'll have
to add a lot more bit setting and clearing statements. This is
yet one more argument in favor of a single return at the end of
each routine.
A variation on this theme is using the scope to measure "delta
time" - the period between two events. Suppose you are concerned
about how much time elapses between an interrupt being asserted
and some other action happening. For example, if the interrupt
signals the receipt of a "conversion complete" signal
from and A/D converter, how long does it take the code to read
data from the A/D? If the analog input varies continuously, fast
A/D reads are probably crucial. Put one scope channel on the interrupt
signal and the other on the A/D chip select pin. Read the difference
in time between the two scope traces.
These techniques measure execution times of individual routines,
or the latency of a single interrupt. With only two channels and
no memory, the scope doesn't identify the time spent in individual
lines of C within a routine. Be clever! Once again, add code to
drive a bit high only during the code fragment's execution. Connect
one scope channel to the bit, and use the other to probe addresses
generated by the CPU. Look at one address bit at a time, making
a paper record of when addresses change relative to the routine's
start pulse. Given a short subroutine, it's not too hard to make
a map of its execution.
Logic analyzers are almost as useful for performance measurements.
Suppose the system runs a "main loop" in 100 milleseconds
or so. Connect the analyzer to the system's address bus and command
it to acquire one sample every 100 microseconds. 1000 acquisitions
after starting, the logic analyzer's internal memory will contain
a sort of statistical profile of where the code runs. It might
miss very short routines, but it will give you a high level picture
very quickly.
You can get a more refined look at the operation of any section
of code by telling the analyzer to "trigger", or start
collecting, when a certain address is sensed. Crank up the acquisition
rate for finer details. If the analyzer can store 1000 cycles,
then with a 10 microsecond rate you'll get a pretty good profile
over 10 milliseconds of execution. In the example of monitoring
an interrupt service routine, trigger collection on the ISR's
start address.
If a scope or logic analyzer isn't your cup of tea, consider connecting
a PC to your system via a fast serial or parallel port. You can
get some crude idea of performance levels by having every routine
transmit a unique code when invoked. Then, write a program to
sift through the data and extract performance parameters. Real
time response might be sacrificed, but again, you'll get an idea
of what the system is doing.
What Then?
Usually performance problems fall into one of three categories:
excessive compiler overhead, crummy code, or slow algorithms.
Modern C compilers generate high quality code. Most overhead problems
come from weaknesses in the processor's architecture, like poor
stack handling abilities. Recently I saw this graphically when
comparing real time trace data on a Z80 to the original C source:
++I (with I a 16 bit signed integer) generated about 25 instructions.
Changing I from an automatic to a static variable cut out 75%
of the overhead.
Crummy code problems come mostly from programming without thinking.
Don't force the compiler to optimize your program for you. Fold
constants yourself, remove unneeded code from loops, and minimize
conversions between types. This is just good programming style.
Keep algorithms and routines to a single page, so you can understand
what is really going on.
Given reasonable code, don't expect huge performance improvements
from tuning a line here or there. Consider replacing a slow algorithm
with one better suited to the application. For example, never
use a Taylor series to compute transcendental functions. Many
reference books (such as Computer Approximations by Hart) contain
much faster converging series expansions. Or, when precision is
not a problem but speed is, do a fast table lookup.
Close the Loop
How do you know if a particular processor will be adequate for
a project? Many of us work for companies that provide a standard
product. New versions often use the old code, or are based on
algorithms and techniques developed on a current product. Consider
instrumenting the current code to measure exactly how much free
processor time exists. Use a real performance analyzer or one
of the methods I've described.
Closing the feedback loop! When any project is complete, spent
a day learning about what you did. Measure the performance of
the system to see just how accurate your processor utilization
estimates were. The results are always interesting and sometimes
terrifying. If, as is often the case, the numbers bear little
resemblance to the original goals, then figure out what happened,
and use this information to improve your estimating ability. Without
feedback, you work forever in the dark. Strive to learn from your
successes as well as your failures.
Track your system's performance all during the project's development,
so you're not presented with a disaster two weeks before the scheduled
delivery. It's not a bad idea to assign CPU utilization specifications
to major routines during overall design, and then track these
targets like you do the schedule. Avoid surprises with careful
planning.
|