For novel ideas about building embedded systems (both hardware and firmware), join the 30,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.
By Jack Ganssle
The Doc Scavenger
"What the hell does this mean?" yelled Scott, his usual complacency violated by the peculiar behavior of a brand-new VHF radio. I looked at the LCD screen and saw the hand-held was in some sort of diagnostic mode, displaying "err 4C" whenever he pressed the transmit button.
I know he had agonized over the $300 purchase, his engineering attitude wanting to insure the radio was the best possible value for the money. We were with our families well outside of the USA with no email access and phone service priced out of sight for support calls. The manual documented quite a few error messages, but strangely not this one.
Weeks later, back home, Scott started wending his weary way through the frustrations of telephone support. "Your call is very important to us, please hold" pales after 30, 40, 60 minutes of Muzak. Finally a technician responded. Clearly baffled, after calling in his supervisor, the tech announced this meant the radio was "broken". Under pressure he admitted that no one there really knew what the message meant, but was unusual so the radio required repair.
Sound familiar? A year earlier my autopilot stopped steering. It sounded three short beeps, paused, and then one more before repeating the pattern over and over. Clearly this was some sort of diagnostic code, one not listed in the manual, and one that tech support couldn't identify. "Send it in for repair" was the oh-so-standard response to all such unknown embedded systems behaviors.
Yet surely the firmware engineers went to a lot of trouble to put these exception conditions into the code. They mean something; but that meaning was lost due to poor documentation. Why bother detecting and reporting the condition when users will be baffled?
It's easy to tell the programmers to log everything they do, and to hope that the documentation and customer support folks will turn these notes into useful web FAQs and manual addendums. But that's absurd. Or at least načve. Developers are busy enough trying to fulfill requirements and work out system bugs. Creating a support doc in parallel just ain't gonna happen, no matter how good of an idea it might be.
Maybe we need a documentation scavenger. Someone empowered to dig through the code looking for things that impact the customer. A person that bridges the development, support, and doc teams.
Or perhaps, if we're wise enough to be doing code inspections, one of the inspectors is always a customer/documentation representative. If you're using eXtreme Programming, where a pair of developers share one computer, the person not typing could capture each of these new exceptions and insure they're emailed to the doc team.
Some might insist that at requirements time we know all of the error codes and other intricacies of the system. In my experience, though, during coding we discover - and write code to trap - odd cases that should never happen, or that we'd hope would never happen. A three level deep IF can resolve into an awful lot of possible conditions, far more than we usually anticipate until we're deep into the unit design or code. That's when a lot of the error messages and exception handlers get written, and that's where they get lost to the customers.
I do remember committing the same sin early in my career. Our instrument solved a very complex series of simultaneous polynomials using an iterative algorithm. (And this was on an 8008!). At times no solution was possible, so I'd programmed it to give up after 20 minutes and display "Help" on the 7 segment LEDs. Years later, long after the device was obsolete and most of the original staff dispersed to other companies, a unit came back for repair. The young technician appeared at my office door, ashen. He had been adjusting internal pots when the machine started flashing "help" messages. He thought some intelligence inside the box was reacting to his repair attempts.
What do you think? How can we pass the critical error messages created by our exception handlers into useful customer artifacts?