The Embedded Muse 439

Go here to sign up for The Embedded Muse.

The Embedded Muse
Issue Number 439, February 7, 2022
Copyright 2022 The Ganssle Group

Editor: Jack Ganssle, jack@ganssle.com

Jack Ganssle, Editor of The Embedded Muse

You may redistribute this newsletter for non-commercial purposes. For commercial use contact jack@ganssle.com. To subscribe or unsubscribe go here or drop Jack an email.

Contents

Editor's Notes
Quotes and Thoughts
Tools and Tips
Freebies and Discounts
Testing for Unexpected Errors
Design For Debugging Redux
Rewriting Code
Failure of the Week
Jobs!
Joke for the Week
About The Embedded Muse

Editor's Notes

Tip for sending me email: My email filters are super aggressive and I no longer look at the spam mailbox. If you include the phrase "embedded muse" in the subject line your email will wend its weighty way to me.

Jack's latest blog: Non Compos Mentis

Quotes and Thoughts

Excessive or irrational schedules are probably the single most destructive influence in all of software. Capers Jones

Tools and Tips

Please submit clever ideas or thoughts about tools, techniques and resources you love or hate. Here are the tool reviews submitted in the past.

In response to my review of SourceMonitor Sergio Caprile wrote:

I do some of my function evaluation in Linux, so I can run some unit tests with ceedling and work with no hardware (it mocks interfaces). There I also use cppcheck (https://cppcheck.sourceforge.io/) and I'm evaluating other alternatives like
pmccabe https://people.debian.org/~bame/pmccabe/overview.html and metrix++
https://metrixplusplus.github.io/metrixplusplus/docs/04-u-workflow/

Freebies and Discounts

This month's giveaway is a 0-30 volt 5A lab power supply.

Enter via this link.

Testing for Unexpected Errors

A reader posed an interesting question: in a safety-critical system how do you test for unexpected errors?

Some errors are expected. A sensor going out of range, for instance. In the case of the 737 MAX crashes the angle-of-attack sensor went to an insane value in a second or less which caused the code to do bad things. That is a testable condition. In the hundreds of "Failures of the Week" that are in each Muse many are due to a crazy input which the code did not provide mitigation for. That's sloppy engineering.

But other errors are impossible to anticipate. Cosmic rays can cause random bit flips. If the program counter is corrupted there's no predicting what will happen. A power supply brown-out will cause, well, pretty much any kind of unpredictable behavior. Dereference a null pointer and all sorts of craziness can happen.

(Note: Many software errors can be handled. Design by Contract stems from the Eiffel programming language, which pretty much nobody uses, but is available in Ada and SPARK. It essentially performs runtime checks to ensure values going into, and returned from, functions meet certain rules. It's one of the most valuable ways to ensure safe code. Alas, too few of us use DbC. One wonders if neglecting such a powerful tool is engineering malpractice.)

The reader's question is important, but testing for correctness is getting things a little backwards. First, design a robust system.

We're in a golden age of microcontrollers where huge amounts of capability are available for little cost. All MCUs now have watchdog timers, the first line of defense against unexpected problems. Some offer window watchdogs which are a bit more of a pain to use, but are better than your average WDT.

I had an epiphany when Intel introduced the 386 decades ago. Everyone hated that the 8086 used segment registers to extend a 16-bit address space to 20 bits. When the 386 appeared developers went mad. Instead of four segment registers now there were thousands. But that seemed brilliant to me. With the 386 one could build a system using an OS where each task lived in its own hardware-protected address space. If a task did something zany the memory management unit (MMU) would throw an exception.

A third of a century later we embedded people are still mostly stuck using 1980s hardware designs. Few MCUs have an MMU. Transistors are so cheap it makes no sense to not add such a powerful crash-mitigation asset to a processor.

In recent years Arm's memory protection unit (MPU) has made it into some (not enough) Cortex MCUs. The MPU is a poor-persons MMU. It offers a handful of independent protected memory areas. Despite its limited capabilities I feel it's a great asset that every system that needs robustness should use.

Some RTOSes are MPU aware, which greatly simplifies its use.

A few vendors offer lockstep MCUs, where two identical processors operate simultaneously. Any difference in behavior throws an exception. TI is one, as well as Freescale (or Motorola or NXP or whatever their name is now).

While good architecture is critical, testing does remain important.

Back in the mainframe days we tested Fortran compilers by feeding a random deck of punched cards into the tool. It's amazing how often crashes occurred, but this did lead to incremental improvements in the compilers. (The University of Maryland's Ralph compiler would abort after 50 compiletime errors and print out a picture of Alfred E. Neuman, with the caption "This man never worries, but from the look of your code, you should.")

Clearly, we want to test for every possible input. But in truth this is liable to be superficial at best. With three 12-bit ADCs there are billions of possible input combinations. A dozen GPIOs means thousands of tests if one wants to check every possible condition. 100% testing is somewhere between intractable and impossible.

One common issue is a problem in booting. There used to be a tool called the Poc-It which cycled power to a system. The target would assert a signal meaning "good boot" when it came up; the Poc-It monitored that and logged errors. Some users coupled that to a Variac to ramp the mains power through a range of values to simulate different international mains voltages and brown-outs. Alas, the Poc-It is no longer available, but it would not be hard to cobble up such a tool.

Exception handlers are exceptionally difficult to test. Sure, one can simulate a divide by zero and see that an interrupt occurs. But how do you simulate that error at every possible divide in the code? And how do you ensure that the system responds safely?

Anticipating and handling errors is one of the most difficult problems we face. What is your approach?

Design For Debugging Redux

Responding to last issue's thoughts about designing for debugging, Stephen Morris-Jones wrote:

As a former freelance hardware designer, I felt that I could respond to the “Design for Debugging” item in “The Embedded Muse 438”.

Daniel McBrearty sent an email about things that should be obvious, but are too-often neglected:

3. LEDs. Please please HW guys - give me at least one LED. You can find the space. When other stuff doesn't work or you are in the bootloader with no debug - a simple LED can make such a difference ... oh and by the way, bring a bunch of those "spare" GPIOs out to something I can solder a wire to as well.

I have pushed for LEDs for most if not all my embedded designs. Typically a ‘power ON LED’ or a ‘heartbeat’ LED. The cost of an LED and associated resistor is usually a fraction of a cent, so even when manufacturing 100k/annum or more the cost is about 30 mins of a system integration engineer’s time. I added to the argument for the LED that it could be a non-fit as the deign matures to save cost, that has never happened. There must be millions of Xerox machines out there now with these from 20+ years ago as well as many other clients. I found the aerospace clients very reluctant to include on the embedded designs as purely for system debug and integration.

I taken over the design or many systems where the team have to get out DMMs or a ‘scope each time a problem occurs to determine if a board has power or is actually running. A simple flashing LED controlled by a processor will rapidly tell someone of power is applied and that the processor is doing something. Not usually much extra to vary flash rates to indicate some mode of operation.

Jerry Mulchin also had some thoughts:

With regard to Daniel McBrearty's piece on "Design for Debugging", I can not over emphasize the importance of what He says.

Back in the day when I was a partner in a small company called InterContinental Micro Systems, an S100 bus company, I always insisted on having at the very least a serial port to use for debugging.

And if I could get an LED, the more the better. Back then having an In-Circuit-Emulator or Logic Analyzer was a luxury and very expensive. After 40 plus years of Hardware/Software designs, I still make sure I have these 2 items on any design I create. And I generally dedicate an LED to an timer interrupt as a "Heartbeat" indicator.

That alone tells you a lot about the state of a hardware design. But the serial port is probably the most indispensable of all. Messages sprinkled around in the code with conditional Debug statements can really aid in diagnosis of a problem. Today's Micro Controllers usually come with at least one serial port if not more. So the basic tool is there for every one to use. The cost is a RS-232 converter and a DB-9 port. Serial Terminal programs abound on every platform used today for programming.

Rewriting Code

And Steve Wheeler had some stories about rewriting others' code:

With regard to Dave Zar’s comment about rewriting the code of others, I have two anecdotes.

More than a decade ago, I had to rewrite code someone else had written to implement a ring buffer. You would think that a ring buffer is simple enough that little could go wrong; there are only a couple of boundary conditions, and you really only need memory for the buffer and a pair of pointers. With a fixed-size buffer, the major design choice you have is whether to drop the oldest or the newest character when the buffer overflows.

Well, this code had been written by someone who was very proud of his master’s degree, since neither I nor our boss had one, and he wanted to show off how much better his code was. His ring buffer code, which he proudly proclaimed as rock-solid, turned out to have a bug. Once the buffer filled, the first overflow character would put it into a persistent error state where incoming characters were inserted twice, losing the two oldest characters in the buffer for each new incoming character. The only way to recover was to reinitialize the serial port.

I got to rewrite his code. He had written about 600 lines of C that implemented the ring buffer as a state machine with 64 states. There were very few comments, mostly about the encoding of the state values. So much for a robust implementation. I replaced it with a more traditional implementation that required a lot less code.

Sometimes, code should be rewritten, regardless of whether it has bugs. At an earlier job, I was tasked to fix a bug in a piece of code that had been originally written by someone who had transferred to another division. I was the second person (after the original programmer) who had to go in and fix a bug in that particular piece of code. The code was highly optimized and very convoluted, and it took me about three weeks to understand it well enough to be able to make the fix. I talked with the woman who had fixed the prior bug, and she told me that it had taken the original programmer about three weeks to get the code that optimized, and that she also had needed about three weeks to understand it well enough to fix the bug she had worked on.

Compared to a simple and straightforward implementation, this highly-optimized code saved several milliseconds every time it executed … which was once during startup. The effort to optimize it was totally unnecessary and imposed unnecessary costs on the company. Knuth’s comment about premature optimization being the root of all evil applies. Unfortunately, the company would not let the code be rewritten. Fixing bugs could be justified, but rewriting “working code” was an unnecessary expense.

Failure of the Week

In honor of the New England blizzard last week, Marinna Martini sent the following. I expect this is not a failure per se; no doubt bulbs are burned out or there was operator error. Still, it's priceless:

Josh Weeks sent this:

Have you submitted a Failure of the Week? I'm getting a ton of these and yours was added to the queue.

Jobs!

Let me know if you’re hiring embedded engineers. No recruiters please, and I reserve the right to edit ads to fit the format and intent of this newsletter. Please keep it to 100 words. There is no charge for a job ad.

Joke For The Week

These jokes are archived here.

From Rick Ilowite:

Q: Would you like some bouillabaisse 2?
A: Yes, but just a bit.

About The Embedded Muse

The Embedded Muse is Jack Ganssle's newsletter. Send complaints, comments, and contributions to me at jack@ganssle.com.

The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster. can take now to improve firmware quality and decrease development time.