By Jack Ganssle

Published in Embedded Systems Design, May 2009

Perfect Software

Micrium, the company that sells the very popular uC/OS-II real-time operating system, now has versions of that RTOS for many processors that have either a memory management unit (MMU) or a memory protection unit (MPU).

I'll get to some details about MMUs and MPUs shortly. But first let me paraphrase an interesting conversion I had with Jean Labrosse, Micrium's president, about his philosophy about the use of a memory manager.

Jean feels that the primary reason to use an MMU is so separate multiple applications running on one CPU. If you have, say, a safety-critical controller and an entertainment system, using an MMU means one can certify the controller's code, and not have to re-certify it if the entertainment component gets changed. He complained that some (he was kind enough not to point the finger at me, but he could have) advocate using the MMU to save a system when it crashes. The MMU prevents a rogue task from overwriting any other task's memory space. There's a good chance that when things go wrong, the task will try to wander off to another space and the MMU can trap the problem and initiate recovery.

Jean is an old friend, so I could give him some amiable abuse. "You're just showing your philosophy about software," I railed. "You're totally intolerant of bugs. Your standard is perfection. That's unheard of in this industry."

In IEEE Computer in January, 2004, Jesse Poore wrote: "Theoretically, Software is the only component that can be perfect, and this should always be our starting point." Software is an odd thing. It's not real; it has no form. Everything in the physical world is full of flaws. Bearings wear, people age, bulbs burn out, and plastic deteriorates. But software, if created perfectly, never wears out.

In general, software isn't perfect, of course. It's created by those aging human beings who make mistakes. Programming requires enormous skills. Programs are some of the most complex of all inventions. The field is young and there's much we have yet to learn.

Perfection is penalized. Windows is hardly ideal, yet Microsoft is by any measure the most successful software company of all time. Capitalism doesn't, it seems, reward getting it right. Profits seem to come from "good enough."

Of course, that's not completely true. The avionics on a commercial airliner better be really good or the vendor won't survive the onslaught of lawsuits.

Our customers want the code to be perfect. But they've learned to manage in a world full of electronic gadgets that sometimes behave oddly. Even a member of Garrison Keillor's Professional Organization of English Majors (POEM) knows to cycle power or yank the batteries for ten seconds when the widget freaks out. Our imperfect code has taught the world the meaning of the word "reboot." I well remember when only techies knew what that meant.

While I don't know if all of Micrium's code is perfect, its quality far exceeds the vast majority of programs I've examined. It's all surely PDP: Pretty Darn Perfect. Micrium is not alone; Green Hills achieved EAL6+ certification on a version of their RTOS, so that code, too, must be PDP. Plenty of other embedded applications meet the PDP level as well.

In fact, according to data collected by Capers Jones that he shared with me, software in general has a defect removal efficiency (the percentage of bugs removed prior to shipping) of 87%. Firmware scores a hugely better 94%. We embedded people do an amazingly good job.

But the irony is that software is topsy-turvy. Unlike physical things it is theoretically perfectible, but also unlike physical things a single error can wreak havoc. A mechanical engineer can beef up a beam to add margin; EEs use fuses and heavier wire than needed to deal with unexpected stresses. The beam may bend without breaking; the wire might get hot but continue to function. Get one bit wrong in a program that encompasses 100 million and the system may crash. That's 99.999999% correctness, or a thousand times better than the Five Nines requirement, and about two orders of magnitude better than Motorola's vaunted Six Sigma initiative.

Bugs destroy projects. They are the biggest cause of slipped schedules. We have all learned to fear the integration phase because of the typically inscrutable bugs that surface. They alienate customers and reputedly have led to bankruptcies.

There are a lot of reasons for bugs. One that's particularly pernicious is our expectation of them; our tolerance for mistakes. Most of us don't believe in perfection, and thus don't use that as the standard. And hence Jean's unwillingness to accept buggy code as a reason to use an MMU.

Companies that strive for PDP code usually achieve it. When that's the expectation, it gets achieved. Though PDP code is not necessarily more expensive than some of the crap that's shipped every year, it does mean the organization as a whole must be striving for perfection. The boss has to embrace the use of the right tools and techniques; his boss has to hold him accountable for getting a PDP product out the door.

The opposite is true: If the company expects buggy code, it'll get it. Joel Spolsky relates his experience on the Excel team when all that mattered was creating functionality. Developers responded predictably. To implement a function that computed a regression, for instance, they'd write:

float regression(args *in);
{
return(1);
}

Then it was just a matter of responding to bug reports from the testers.

Whence Perfection?

Though I love Jesse Poore's quote mentioned earlier, perfection is an idealistic dream for most of us. I'm sure Jean Labrosse would argue vehemently, and I admire him greatly for his stand and his success at achieving it. But in this world of million-line programs I have seen no evidence that perfection is about to take the embedded world by storm.

But we should expect PDP. Code that works and that we are utterly confident in when we ship it. Code that has been crafted from the beginning to be correct, using techniques long understood to lead to quality. Standards, inspections, design reviews, decent specs and the use of tools like lint and static analyzers that scrutinize its construction and behavior.

Tests are important: PDP tests that overlook nothing. I suspect the unit tests the Excel team employed were, shall we say, /PDP. The test code needs the same attention to detail as that for the end product. Today, sadly, most test code is considered throwaway garbage not worth a lot of effort.

The statistics are bleak: Typical test routines exercise only half the code. Worse, few of us have any quantitative knowledge of how many tests are needed. Do you run cyclomatic complexity figures on your products? Cheap tools compute it automatically. The complexity gives a minimum number of tests that must be run to ensure that a function is completely tested. We need to use these sorts of metrics to audit our testing. I've written about this here http://www.embedded.com/columns/technicalinsights/206901032?_requestid=101521.

One of the most brilliant things to come from the agile movement is their focus on automated testing. That can be tough in the embedded world where someone has to press the buttons and watch the displays, and I see few teams in this industry that have excellent automated tests.

Some agilists take me to task; they suggest using mock objects to simulate the hardware. It's a nice and worthwhile idea, but always merely a simulation. Real tests are important and some clever people have created cool external test environments. Several companies I know aim a video camera at the displays and use LabVIEW to interpret the results, closing the test loop using an external computer.

That's pretty cool.

PDP code also often needs built-in surge protectors like those beefy beams and oversized wires. Things do go wrong. Even perfect code and perfect hardware does not necessarily mean a product is PDP. Stuff happens. EMI, EMP (well, hopefully not), brownouts and other effects can disrupt execution. Cosmic rays are increasingly disruptive as device geometries shrink.

It's amazing that computers work at all. A PC moves billions of bits around every second. The voltage difference between a zero and a one keeps shrinking. If any one of those bits is corrupted by a cosmic ray something will go noticeably wrong. We depend on that reliability and expect it. But it's not at all clear to me that expectation is reasonable.

Surge protectors will most often save our systems from an imperfection introduced by us. A self-healing system that captures that one nasty bug that escaped our PDP tests might save the system.

And so, I've long wondered why MPUs are so uncommon in the embedded space. Intel's 386 processor brought memory management to the desktop a quarter century ago; who hasn't used Task Manager to kill off an errant app? The OS and other programs generally run on benignly despite the crash.

An MPU segments the address space into a number of smaller areas, each usually owned by a task or application. The MPU defines access rules: Task A cannot write in this area and can only execute from another region. Essentially each activity operates in its own sandbox. Bad code might foul it's own sandbox but can't reach out nastily to the rest of the system.

Hardware in the memory manager watches every transaction and immediately interrupts code that tries to break the rules. Developers can write a handler to shut the task down, restart it, or take some other action to bring the system to a safe and known state.

When the 386 came out developers reacted with a storm of protests. Many complained that writing protected mode code was too hard. I haven't heard that complaint in a long time, as the tools improved and now hide the MPU meddling behind a curtain of abstractions. Some RTOSes also provide insulation from the grittiness of handling memory management. Some, like the new version of uC/OS-II, even use the hardware resource to guarantee that activities get a chance to run. Real-time does mean managing time, and so this is an important feature.

MMUs are testosterone-enhanced MPUs. They include the MPU's protection scheme but add virtual memory. Logical memory gets mapped under software control into the physical address space.

In the many years I've been writing about embedded systems the industry has grown and the applications we develop have gone from tiny things, really, to hugely-complex multi-million lines of code monsters. The good news is that we've gotten better: I really don't think the approaches used decades ago would have scaled to the problems we solve today.

This is an exciting time to be in the field. There is no stasis here, not in the processors and I/O we use, nor in the tools that are available. Static analysis, virtualization, MMU/MPUs and a host of other resources are ours for the buying.

But first, above all, we must approach our projects with an intolerance for bugs. The code might not all be perfect, but PDP is a pretty good goal, too.