Dangling pointers kill - but they are everywhere. Originally in Embedded Systems
Programming, November, 1999.
For novel ideas about building embedded systems (both hardware and firmware), join the 35,000 engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.
By Jack Ganssle
It's rather interesting - and to me distressing - to look over the history of embedded debug tools and realize that most of the features we use today were introduced by Intel in their very first emulator back in the mid-70s. After a quarter century our products have rocketed from 2 MHz eight bit CPUs to 10 gigawowiehertz 32 bit computing monsters, yet sometimes our tools seem trapped in a time warp. We're using breakpoints, real time trace, emulation RAM, and not a lot more.
During my time in the in-circuit emulator business, in the pursuit of product differentiation we were constantly on the prowl for solutions to common debugging problems. Our dream was the ultimate debugging interface: two buttons, one labeled "find bug" and the other "fix bug". A lot of generally unsuccessful time spent looking for any sort of improvement to the tools left me feeling that debugging is still an art, something generations away from being codified into science.
Every programming text urges us to use the very best defensive strategies in our code, yet, whether due to panic-inducing schedules or just poor self management skills few developers ever think beyond today's coding to tomorrow's inescapable debugging.
In 15 years as a tool vendor I metaphorically looked over the shoulders of thousands of embedded developers and saw virtually all just cranking code as fast as possible, with nary a thought about just how they'd make their creations actually work. Been there, done that myself. We're an optimistic bunch, usually convinced that this time things will require minimal debugging. The ugly truth is that debugging eats something like 50% of project development time. An uglier truth is that reactionary debugging - responding to bugs - guarantees poor quality firmware. Too many bugs won't exhibit problems immediately; many lie latent for years till the code is unusually stressed.
C compilers should come with a warning label: "Danger! Use of this product may lead to memory leaks, corrupt data, and erratic crashes." C brings joys and perils; we must deal with both.
It's intriguing how two major languages from the late 60s went in quite different directions. ADA proposed the idea of creating correct code from the outset; an old saw says that if an ADA program compiles, it will work. A very picky compiler largely keeps developers from writing code that fails due to stupid problems. C, on the other hand, fits the 60s "anything goes" image: free love, free music, and the freedom to write incomprehensible programs that fail in mysterious ways.
Anyone who has worked in C suffers from pointer-wielding scars. On one hand pointers helped make C the embedded language of choice. They give us much of the power of assembly, while packaging the capabilities inside a HLL. On the other hand pointers almost always lead to travail. Novices misuse them, mixing up referencing and dereferencing, adding not enough or too many asterisk-prefixes. Professionals bypass such simple errors, but still create code that overruns buffers or writes over the code itself.
Though pointers create agony in pretty much every system ever produced, most of us react to each problem with surprise. Given that pointer issues are so common, why don't we build our systems from the outset in a way to trap these inevitable problems? Why don't we buy tools that track and flag these problems automatically?
A couple of companies do sell pointer and memory checking tools, though are mostly aimed at the desktop applications market. Geodesic claims that 99% of all applications ship with significant memory and pointer problems. How this number translates into the embedded market no one knows. Even if off by an order of magnitude, it's still pretty scary.
Parasoft (www.parasoft.com) and Nu-Mega (www.numega.com) both have tools aimed at the desktop and Windows CE market, about as close to the embedded industry as any commercial product I've found. Nu-Mega's BoundsChecker has surely been an industry staple for a very long time.
It's a shame all of this great technology hasn't been ported to the mainstream embedded world. That task has been left to a few charitable souls who wrote decent tools which they put into the public domain. Walter Bright, author of a popular C compiler made his mem.c routines available to us (available at www.snippets.org). This package detects most common problems associated with memory allocations,
Mem.c tracks obvious problems like frees without corresponding mallocs. More interestingly, it picks up many sorts of pointer problems by allocating a bit of storage before and after each malloc'ed block, and then filling these extra areas with signatures. After a free, mem.c checks to insure the signature is intact; if not, a pointer over- or under-ran the buffer.
Jeff Dunlop's memory checking package, also at www.snippets.org, offers more checks, including some more appropriate for embedded systems that may not use the malloc() function call. Malloc(), of course, often leads to memory fragmentation in systems that run for months or years. A desktop application might tolerate fragmentation, since the user probably exits the program from time to time! and ultimately (at least in the Windows environment) expects a certain number of system crashes. Though Dunlop's package includes tests for malloc'd blocks, it also supports arrays and statics.
Both Bright's and Dunlop's code replaces standard library functions, so must be linked into your code during development. The moral to this is that if you link the code in from the very beginning of product development, errors will pop up as your code does unreasonable things. Don't link it in, and odd crashes will leave you puzzled. Even if you suspect a memory problem you probably won't relink to include the diagnostic routines, believing (as we all do) that a few more minutes work will turn up the source of the problem. This is sort of like avoiding make files, since creating the make might eat up 20 or 30 minutes and we just www.snippets.org, normal">know we'll only need to build the code a few times.
One commercial product aimed squarely at embedded memory troubles is CodeTest from Applied Microsystems (www.amc.com). It's an external hardware tool that relies on instrumented code to track what gets allocated and freed when.
There's a performance penalty, of course, associated with using any of these packages. If your code must run so fast that no speed degradation is possible, then these tools are not for you. Remember the rule of thumb: a 90% loaded system doubles development time; at 95% loading the schedule is three times longer than for a lightly loaded system. When performance issues are so severe reasonable tools fail, then it's time to reconsider the design.
Embedded code written in any language seems determined to exit the required program flow and miraculously start running from data space or some other address range a very long way from code store. Sometimes keeping the code executing from ROM addresses feels like herding a flock of sheep, each of whom is determined to head off in its own direction.
In assembly a simple typo can lead to a jump to a data item; C, with support for function pointers, means state machines not perfectly coded might execute all over the CPU's address space. Hardware issues - like interrupt service routines with improperly initialized vectors and controllers - also lead to sudden and bizarre changes in program context.
Over the course of a few years I checked a couple of dozen embedded systems sent into my lab. The logic analyzer showed writes to ROM (surely an exercise in futility and a symptom of a bug) in more than half of the products.
Though there's no sharp distinction between wandering code and wandering pointers (as both often come from the same sorts of problems), diagnosing the problems requires different strategies and tools.
Quite a few companies sell products designed to find wandering code, or that can easily be adapted to this use. Some emulators, for instance, let you set up rules for the CPU's address space: a region might be enabled as execute-only, another for data read-writes but no executions, and a third tagged as no accesses allowed. When the code violates a rule the emulator stops, immediately signaling a serious bug. If your emulator includes this sort of feature, use it!
One of the most frustrating parts of being a tool vendor is that most developers use 10% of a tool's capability. We see engineers fighting difficult problems for hours, when a simple built-in feature might turn up the problem in seconds. I found that less than 1% of people I've worked with use these execution monitors, yet probably 100% run into crashes stemming from code flaws that the tools would pick up instantly.
Developers fall into four camps when using an execution monitoring device: the first bunch don't have the tool. Another group has one but never uses it, perhaps because they have simply not learned its fundamentals. To have unused debugging power seems a great pity to me. A third segment sets up and arms the monitoring tool only when it's obvious the code indeed wanders off somewhere, somehow.
The fourth, and sadly tiny, group builds a configuration file loaded by their ICE or debugger on every startup, that profiles what memory is where. These, in my mind, are the professional developers, the ones who prepare for disaster long before it inevitably strikes. Just like with make files, building configuration files takes tens of minutes so is too often neglected.
If your debugger or ICE doesn't come with this sort of feature, then adapt something else! A simple trick is to monitor the address bus with a logic analyzer programmed to look for illegal memory references. Set it to trigger on accesses to unused memory (most embedded systems use far less than the entire CPU address space; any access to an unused area indicates something is terribly wrong), or data-area executes, etc.
I've had great success doing this with HP's MSO, a sort of combined logic analyzer and scope. Since the scope half of the instrument gets the bulk of the use, I'll leave the analyzer set up as a poor man's monitor.
If a logic analyzer is too rich for your budget, check out the $1295 PodAlyzer (http://www.associatedpro.com/aps/pod-8020.htm), a device the size of a roll of stamps that connects to a PC's serial port. With 18 channels it's ideal for monitoring 8 and 16 bit systems.
Part of the downside of using any logic analyzer is that it takes too long to connect all of those annoying probes to a typical CPU's whisker thin SMT leads. After using these devices for more than a generation I've gotten neither faster at connecting the leads, nor more accurate at getting them right, than when I first started. The best solution is to build a logic analyzer connector onto the prototype target system. Without it, you'll resist using this very effective software-diagnosis tool. Add the connector and you'll use the analyzer constantly and effectively.
Note that some emulator vendors, frustrated with the difficulty of connecting to SMT processors, now suggest users install a special emulator connector on target boards (see www.hitex.com for one company's clever approach). Even those of us using nothing more than an analyzer should emulate this example.
Some ICEs include code coverage, a feature that tells you if every line of code executed. One study indicated that fully 50% of the code in embedded systems is never tested (after all, error handling, deep IF conditions, complex switch statements all lead to special cases most QA programs can't manage). Code coverage instruments insure that test cases do check each possible condition.
Hardware-only code coverage tools watch the address bus and log each instruction fetch. These will generally log executed addresses with no corresponding code, which typically indicates wandering code.
Beyond the hardware approaches, write the application defensively. For instance, fill your unused interrupt vectors will pointers to a debug routine. Configure the tools to set a breakpoint on that routine automatically every time you load the debugger. A bad vector will show up immediately, not after the processor executes a million instructions from your data area.
Seed unused memory with illegal instructions. Few apps use every last byte of RAM and ROM; instead of leaving these areas set to random values, take advantage of the one-byte call, illegal instruction, or breakpoint instruction that almost every processor supports. On a Z80 it's RST 7; a 68000 has an illegal instruction trap; the 683xx includes a specific breakpoint instruction. If the code wanders into one of these unused regions it will take the exception. You've wisely (hopefully) set a breakpoint on the exception handler, so will find the problem immediately.
If some of the addresses are not tied to a memory devices, pull the bus to an illegal instruction with resistors - at least for the prototype - for the same reasons.
Given that you've detected that code wandering merrily outside of the ROM range, what then? If you're using an ICE, logic analyzer, or a trace-enhanced BDM, use real time trace, triggering it to stop collecting when the exception handler starts. Look back a few instructions in the buffer to find the problem. A more limited tool like a ROM monitor still can yield significant clues by if you examine the call stack.
Solving problems is a high-visibility process; preventing problems is low-visibility. This is illustrated by an old parable:
In ancient China there was a family of healers, one of whom was known throughout the land and employed as a physician to a great lord. The physician was asked which of his family was the most skillful healer. He replied, "I tend to the sick and dying with drastic and dramatic treatments, and on occasion someone is cured and my name gets out among the lords."
"My elder brother cures sickness when it just begins to take root, and his skills are known among the local peasants and neighbors."
"My eldest brother is able to sense the spirit of sickness and eradicate it before it takes form. His name is unknown outside our home."
Great developers recognize that their code will be flawed, so instrument their code, and create toolchains designed to sniff out problems before a symptom even exists.