Embedded Muse 219 Copyright 2012 TGG January 17, 2012
You may redistribute this newsletter for noncommercial purposes. For commercial use contact email@example.com. To subscribe or unsubscribe go to http://www.ganssle.com/tem-subunsub.html or drop Jack an email at firstname.lastname@example.org.
EDITOR: Jack Ganssle, email@example.com
- Editor's Notes
- Quotes and Thoughts
- Tools and Tips
- On Margins
- Just Reset It
- Joke for the Week
- About The Embedded Muse
Lower your bug rates. And get the product out faster.
Sounds too good to be true, but that's the essence of the quality movement of the last few decades. Alas, the firmware community didn't get the memo, and by and large we're still like Detroit in the 1960s.
It IS possible to accurately schedule a project, meet the deadline, and drastically reduce bugs. Learn how at my Better Firmware Faster class, presented at your facility. See http://www.ganssle.com/onsite.htm .
Quotes and Thoughts
In the spirit of my comments about putting margin in the firmware, John Black sent this: Any fool can design a bridge that can stand up. It takes an engineer to design a bridge that can just "barely" stand up.
Tools and Tips
Charles Manning had this comment about scopes: "I started looking for a low cost scope. I first looked at USB scopes but then ended up just buying a Rigol DS1102E. Cheaper than many of the USB scopes and doesn't take up PC screen real-estate. http://www.rigolna.com/products/digital-oscilloscopes/ds1000e/ds1102e/"
Scott Nowell had this intriguing response: "Your statement "My desktop executes billions of instructions per second, even when sitting more or less idle." caught my attention. Since we don't shut the office machines off at night we have lots of idle computes available. Rather than completely waste that power I have installed World Community Grid applications, http://www.worldcommunitygrid.org/. The application joins you to a grid computing network that uses your spare cycles as a screen saver application to solve important problems that require enormous computer power.
"They are currently running about a dozen projects that work on fighting Malaria, Muscular Dystrophy, and Cancer. There are also projects for clean energy and clean water.
"You can set the amount of disk space it uses, the percentage of cpu power and when it runs.
"This is a great way to turn waste heat into useful activity. Please encourage your readers to join in this project."
I'll present my Better Firmware Faster seminar in Melbourne and Perth, Australia February 20 and 26th. All are invited. More info here. The early registration discount ends January 20.
Bryan Murdock wasn't too keen about my comments about hardware working well: "As a firmware engineer I'm really surprised that you made this statement. Have you never had to provide a firmware workaround for an ASIC bug? You've never been told that changing the hardware at this stage of the game is too expensive, so please come up with a way to fix it in firmware? You've never written code to characterize, calibrate, or compensate for out of spec, out of margin, or just plain faulty hardware components (chosen for their low price and the amount of direct material cost they would save the product)? I find that very hard to believe.
"That being said, I think the rest of your thoughts on design margin and the cost of software are very insightful. The push and pull of optimizing design costs, material costs, and production costs in an engineering project is always fascinating, and I don't think people put enough thought into how the software component of a system effects those equations. Too often people just assume that the software will be relatively low-cost, yet buggy, and that there is no way to change that.
"One interesting contrast with software design is ASIC design. It looks similar to software design (at least, a lot more similar than bridge design) in that you are designing with code that can be simulated to give you feedback on the design very quickly. A big difference from the usual software design process is that the ASIC design is locked at some point and no changes are allowed. A huge focus is then put into design verification before the code is "shipped." ASICs do usually end up with fewer bugs than software (but never zero bugs!). I would argue that locking down the design and focusing solely on verification for a while before shipping is a large reason why. Nobody does that with software. Why? There's a very important and valid reason, and I have already written about that on my blog: http://bryan-murdock.blogspot.com/2009/07/averse-to-change.html
"I've also written about a way that I think we can think about software "material" costs: http://bryan-murdock.blogspot.com/2009/07/simple-software-cost-measurement.html
"That even fits what your thoughts on design margin in software. Yes, we can add more margin to make the code more reliable, but that's more lines of code (which may include tests, they are lines-of-code too), which equals higher cost."
Phil Koopman wrote: "Nice essay. This is a topic I often rant about when teaching embedded systems students.
"I've boiled down to the following: Bridge builders put in twice as much steel and concrete to make their systems twice as strong as they need to be on paper, which forgives a multitude of sins. How can we make our software twice as strong? Putting in twice as many lines of code is not the answer!
"While this isn't the whole solution, I think it is worth adding robustness to a system. A robust system doesn't fall over and die when encountering an unexpected input at a defined interface. For example, a desktop OS shouldn't crash when it sees a null pointer input on a system call (but I've definitely seen that happen!). Robustness Testing is an efficient way to find these problems, and is a subset of the more generic area of Fuzz Testing. An increasing number of desktop systems uses these techniques, usually motivated by security concerns. But embedded systems can benefit from these techniques to improve system stability even if security is not a primary concern."
Charles Manning replied: "This is a really interesting way to look at the differences. However, there is one way in which I think it falls down. You still consider the hardware and firmware/software designs to be decoupled.
"Rather than try to build margin into just the software, it is better to try to build margin into the whole system. For instance sampling buttons with an ADC and using digital filtering takes you away from a binary view of buttons and allows for there to be some margin in the software.
"There is of course software margin when it comes to timing. If the code can only handle 1000 interrupts per second then using that to keep track of a peripheral issuing 990 interrupts per second does not leave much margin for timing problems. When I set up a new interrupt handling scheme I will try to max it out (eg. toggle an interrupt pin from within the ISR which fires the ISR again and count the number interrupts). I want to see interrupt processing rates of at least 5x my maximum expected rate to feel comfortable.
"Having worked with NAND flash for the last ten or so years has been rather interesting. NAND is about the only electronic componentry that is shipped with known bad regions and where you have to expect things to fail on you (without considering it a device failure). Software that works with NAND must be designed to cope with the failure.
"Writing software to cope with ESD bit-flips sounds a bit too paranoid. ESD should not have to be handled by software and is better handled in hardware (eg. using CPUs and RAM with built-in ECC).. If you're seeing ESD bit flips then there are very likely other things that cannot be trusted. If your stack can be corrupted, then what about the code itself? Once you get that paranoid you might as well give up."
Just Reset It
In 2003 a Boeing 747-400 aircraft lost all engine and flight displays. Pilots flew on backup instruments for 45 minutes before ground technicians radioed back a fix. In 2001 the same type of aircraft experienced the same problem, which in this case wasn't repaired till the plane landed.
In both cases the fix was the same: cycle the circuit breakers. Punch reset, hit control-alt-delete, cycle power.
A software problem locked up the Pathfinder spacecraft's computer as it descended to a landing on Mars. The watchdog timer brought the system back to life. It just reset the CPU.
The Clementine lunar mapper dumped all of its fuel when the software ran amok. There was no watchdog. The mission was lost because ground controllers couldn't just reset it.
One reader wrote that his stove's oven fan went whacko; apparently computer-controlled, its health was restored when he cycled power. He just reset it. My 4 year old niece offered a helpful suggestion while I was in the middle of resolving a LAN routing problem: "turn it off and turn it on, Uncle Jack; that always works for me." And we all know how to fix a Windows machine that's low on resources.
We just reset it.
I wrote a piece about watchdog timers (http://www.ganssle.com/watchdogs.htm). It averages 2000 downloads a month. Developers are apparently aching for a device to restore crashed systems back to life. Something that just resets it.
The culprit is often buggy code. The software revolution gave us tremendous functionality in the most mundane products. And it takes those capabilities away, randomly, usually at the most inconvenient moments. Sometimes for inexplicable reasons doing things in exactly the same manner we've used for months leads to a crash. But that's not a problem, cycle power or yank the batteries for 5 minutes. Just reset it.
We've created new life strategies to cope with the problems. My family calls me the "save it" czar, since anytime I notice anyone working on a document, spreadsheet, or other data manipulating tool, and the task bar is labeled "new document" instead of some filename that indicates at least one save took place, I slap them around. Metaphorically, of course. When using Word my left hand subconsciously rolls through the control-s save-mantra a couple of times a minute.
Old-timers will recall the debates that raged 30 years ago about reset switches. Were they needed? Desirable? Or perhaps a red flag to customers that even the vendor didn't trust the code they were shipping?
Today there are no reset switches, and products are huge, often employing hundreds of thousands of lines of code. Fact is, humans write this stuff. Humans are, last I checked, imperfect. Our work products always reflect ourselves, for better or for worse. Bugs abound. And that's not to mention other effects, like the increasingly-common random bit flips from cosmic rays.
I think the next great change in firmware development will be self-recovering code, software that detects failures and initiates a graceful recovery transparent to the user. One approach is to carefully segment the code into tasks, each protected by an MMU, coupled with exception handlers smart enough to restart the flawed thread. Now that transistors are free why not stick an MMU even in a cheap 8 bitter?
But until then we'll use the same old technique.
We'll just reset it.
Let me know if you're hiring firmware or embedded designers. No recruiters please, and I reserve the right to edit ads to fit the format and intents of this newsletter. Please keep it to 100 words.
Joke for the Week
Note: These jokes are archived at www.ganssle.com/jokes.htm.
Dejan Durdenic sent this web site: http://www.ultracad.com/article_humor.htm . The story about flying electrons is priceless.
About The Embedded Muse
The Embedded Muse is a newsletter sent via email by Jack Ganssle. Send complaints, comments, and contributions to me at firstname.lastname@example.org.
The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster. We offer seminars at your site offering hard-hitting ideas - and action - you can take now to improve firmware quality and decrease development time. Contact us at email@example.com for more information.