Happy New Year to you and yours!
2019 marks the 22nd year of publication of The Embedded Muse (all 365 back issues are here). Thanks so much for your dialog, ideas and thoughts over the years. I truly enjoy the give and take, and Muse readers benefit from the collected wisdom of your thoughts.
Traditionally the New Year holiday is a time for some reflection and a time when many make themselves promises about changes they'd like to make. Me, I'm not making any commitments to going to the gym (I don't think I've ever been to one), or eating better, etc. My goal is to learn more about many things. Last year I studied microbiology, and have gotten just a tiny bit of a layman's understanding of this fascinating subject. I just finished reading a college textbook on circuit theory, to see what I remembered from those EE classes oh-so-long ago. While the material is not particularly difficult for an experienced engineer, I found a new admiration for the students willing to slog through this, while their non-STEM friends are partying parental profits away.
When I ran engineering teams we used the new year as a chance to reflect on our practices and needs. Were our tools adequate? Did we need to get smart about some new technology? Were there ways we could improve our methods and processes?
How about your team? Are you satisfied with the group's effectiveness, productivity and quality? Have you benchmarked the team against industry standards? If not, check out my Better Firmware Faster seminar.
Latest blog: Medical Device Lawsuits.
A lot of readers responded to the last issue's article about watchdogs. Here are some of the highlights:
John Carter wrote:
I agree wholeheartedly with you that watchdog timers are required....
But over the years I've been mentally cataloguing interesting ways in which people have got them wrong.
- Watchdog timers war with debuggability. Logging or debugger takes too long....the watchdog pulls the rug and everybody gets hopelessly confused. So dev's switch the watchdog off or disconnect it so they can debug..... Now what, exactly, guarantees it gets switched on / connected before you ship? Oops.
- A watchdog timer on a perfectly working system never barks right? So if everything is ticking along flawlessly.... But watchdogs have bugs too, always explicitly have a test for it and test it.
- When the watchdog timer pulls the hard reset.... Why? What was wrong? How do you even begin to debugging it? You can't intercept a hard reset. Answer: Get it to yap a little early and have a tiny simple routine record whatever debug info it can... and then reset hard. If the early warning freezes too, big brother barks and resets hard too.
- Everything is ticking along, the watchdog is being patted, but the system is barking mad and doing nothing useful. Answer: At the point where the system has earned it's keep successfully, _that_ is where you need to put the watchdog pat, not on a vacuous timer nobody cares about.
- In multi subsystem systems, if the subsystems maintain shared state, resetting one won't help. You have to reset all that maintained that shared state and bring them up in the correct order. Systemd ge's a lot of flak, but this is one of the good things it does for you.
- Keep some track of how often resets are happening. I have seen a system that behaved in a flaky manner.... that was because it was resetting very often, but nobody was looking!
John's item 3 is important. I have seen systems that merely assert the non-maskable interrupt (NMI) when the watchdog fires. The thinking is that the NMI handler can then log debugging breadcrumbs. But that's dangerous. On some processors, if the stack pointer becomes an odd number a bus fault is generated. If that leads to an NMI, then a double bus fault results, and the only way out is a hard reset. Also, if the watchdog goes off due to a cosmic ray or EMI, the MCU's little brain could be completely scrambled, and only a reset will guarantee the processor will come back to life.
An interesting alternative is to hit NMI on a watchdog timeout as, with luck, it might be possible to log what happened, but then have an external timer fire a ms or so later, no matter what, to assert the CPU's reset line. Most likely you'll get some decent debugging info, and the system will come back to life.
George Farmer's email resonated:
When I was in aerospace we had a mantra: If it's on the same die, it won't fly.
Among numerous cardinal design sins this mantra was intended to avoid, one of the most obvious was the use of the microprocessor's internal WDT. For very good reasons we were strictly forbidden from relying on this feature and absolutely had to design an external WDT instead. Also, we could never use an epoch-based (AKA event-based) watchdog - as these were WAY too easy to defeat - instead relying on a signature-based watchdog. Lastly, the WDT had to perform a hardware reset of the whole system, not simply an NMI of the micro.
The signature was both pattern- and temporally-based, meaning a series of bytes had to be written to the WDT in a certain order, and each had to fall within a specific window. In essence, this combined the features of your 68332 and TI/MAX chip examples. So, if we had, say, seven safety-critical tasks that had to be executed in a specific order, each within a certain amount of time, at the end of each task a unique byte value was sent to the WDT by that task, tickling the watchdog. In essence, each task punched their own Ticket (each byte had to be unique to each task). After the signature was complete the WDT would clear all ticket registers and start over.
If a task was overrun for whatever reason and failed to complete, or complete in time, the WDT reset the entire system, but did not clear the ticket registers. The WDT would set a port bit that the micro could then read upon reboot, indicating a timeout had occurred. For debugging purposes the signature (or ticket registers) could be read back by the micro, providing valuable information on which task failed.
In a basic embedded systems class I used to teach, I demonstrated just how easy it was to defeat a micro's internal watchdog by exposing an 8751 to the overhead fluorescent lights (basically just left the EPROM window uncovered). The demo was simple: Cycle through the port lines, flashing a bunch of LEDs in a well-defined pattern, all the while with the internal WDT enabled. When the erasure window was covered, all was well and good. Upon removing the opaque cover, within a short period of time the micro would latch-up in a purely random state. I had built in a 3-second delay after boot-up with only one rapidly-flashing LED before the general LED sequence began flashing so the audience could clearly see what was going on. Typically the WDT did its job and reset the micro just fine - a few times. But within a very short period of time after exposure to the overhead lights the entire micro would simply quit - its internal watchdog failed to pull the micro out of its funk.
Another design sin we avoided at all costs was reliance on the micro's hardware reset line on IO programming states. With no exception, each and every IO line and port function had to be explicitly set up in software as soon as possible after a hardware reset. All too often I cringed at programmers who would state they simply didn't bother programming a particular port, say, because it defaulted as all inputs after a hardware reset. Yikes! SCARY!!
Finally, in my 34+ years in embedded systems design, and after both reading and investigating countless system failures, the single biggest danger I consistently see in the embedded world is hubris. At my previous employer I was shocked by the flippant treatment of safety-critical applications - especially with respect to software design and testing practices. Indeed, we had an engineering director that bragged how he once wrote over 40K lines code for his graduate thesis project and "... it worked perfectly the first time." He therefore could not see how rigorous software design processes would be or could be justified. Faster, Better, Cheaper (pick any two). Sadly, culture starts at the top, and his arrogance trickled down to his direct reports. Needless to say, my respect for the guy waned pretty quickly.
Tom Mazowiesky had an interesting point:
Regarding watchdogs, I do use them, but I only add the functionality late in the development cycle, otherwise people tend to use it as a crutch (it's ok the watchdog triggered and brought us back to life). By saving it until late in the program, it forces us to clean up all those nasty little problems. I think even if you're very disciplined in development, it's still possible to create an unrecoverable situation, whether it be hardware or software caused. In a very old project (early 1980's) a high speed printer with three CPU's, we had a reset button on the development units control panel, and the boss noticed that we used it fairly frequently. So one day he said "you know the customer won't have that". That comment made us go back and find all the problems that caused the unit to lockup.
We do bill validators and had a single customer report a very intermittent problem where every once in a while the validator would go completely offline with one of the banks of scanning LEDs on. It wasn't during note scan, the unit was in IDLE mode(where the LEDs are never activated) and the unit was completely frozen. The LED activation was bad because with the LEDs on continuously, heat in the unit would rise and possibly damage some of the plastic parts. We never did find the real cause, but we did activate the watchdog timer in the code at this point, sent it to the customer and gradually upgraded all our other customers as well. The problem has never occurred again.
So I think they are very useful, as long as you don't depend on them for everyday stuff, but for real disasters.
Mike Lease had three bullet points:
Interesting discussion on watchdogs. My 2 cents on the subject:
- I worked with a micro back in the 90's that the watchdog timer only forced the program counter to 0. It didn't actually perform a hardware reset so if an on-chip peripheral needed to be reset it wouldn't be. That's probably not an issue with newer micros but something to check before relying on it.
- On a more recent project, the customer insisted on using an external watchdog circuit as a backup to the internal watchdog. His concern was more about protecting from a power glitch that might prevent the watchdog timer from working than just protecting from a firmware hang. Their product was used in very remote places and only specially qualified people had access to it so a service call could take weeks to happen and cost thousands of dollars. In that case the belt and suspenders approach is easily justified.
- With the proliferation of micros inside peripherals (and multiple micros on a board) it may also make sense to use an external watchdog to ensure all the micros in the system are reset to keep them from getting out of sync.
Personally, I think using the micro's watchdog should be the minimum level of protection for most device. An engineer really needs to consider what they are trying to protect the system from and the level of protection that is appropriate for the function their particular product performs.
Dave Telling's products don't always allow a watchdog:
Your latest Muse had a question about whether or not we use watchdogs.
In my designs, watchdogs are a "sometimes" thing, and here's why: most of the designs I did for ignition systems lived in an environment with massive local EMI from firing spark plugs. Dealing with spark-noise induced system failures was a PITA, and I used a lot of filtering (both physical and firmware) to reduce false triggering and similar problems.
However, one thing that proved very difficult to deal with was noise that was coupled into the crystal (or ceramic resonator) lines for clock generation. spikes here would cause the uC to lock up, and about the only thing that seemed to work was to use the internal RC clock, if possible, to avoid this issue. That being said, even then we would occasionally have system lockups when testing in a deliberately severe EMI environment, but I chose not to use a watchdog, because if the system stopped making sparks, the exhaust would load up with unburned fuel, and if the watchdog resumed making sparks, there would be a horrendous explosion in the exhaust, which could (and did) blow mufflers apart, and scare the you-know-what out of the people in the car! So, we decided it was better to make the system as robust as we could, and if it failed, the operator would have to manually cycle power to re-start the system.
On other products, we could use the watchdog to reset if a weird condition occurred.
Here's Rod Main's take:
In an ideal world code would have no bugs. The board would be hardened against EMP and cosmic ray impacts and people would only use your device for what it was intended and in the way in which it was intended. And pigs would fly...
In reality, systems are never that simple. They tend to have optional ways of working depending on hardware inputs or reconfigurable parameters. Mostly, as you point out, to deal with the fact that requirements are vague and misinterpreted. Followed up by the ever popular "We used it for... " < some process you'd never heard of before> " ...and it crashed. We thought it would just work."
Should I be concerned that some people think code can be made crash proof? Of course, we want our code to have an infinite up time. We don't want to have memory leaks. We want to write "great code". The prudent programmer will also be pragmatic. If you cant see every eventuality (and if I could I'd be winning the lottery every week) then you need to deal with the situation where perfection hasn't been achieved. If it is possible that the software can get into an infinite loop - maybe due to some interaction with its hardware inputs just hitting that "magic" timing moment - then how do you get out of it?
A hardware watchdog may be the last line of defense. Hopefully, it just causes an interrupt where you can save the context of where the application was and put it somewhere for later extraction and analysis and give you the chance of plugging another hole. Even if it just resets the processor, at least you get to start afresh.
Many years ago we realised that if our watchdog did kick in, everything would reset and our controller would restart as if from power off. Readings are zeroed, controls are set to off and everything would stop. You can imagine several cases where this would be ...lets say... unhelpful. We modified our code so that every start up checks to see if the hardware we are controlling is actually running. If it is, our "initialisation" reads all the current states and resynchronises with the running system. The result is, our system can reset without the customer being aware that there has been a problem.
We did, however, add a warning message so that our site engineers can see that something has happened. We know the situation only happens rarely but where hundreds of thousands of dollars of product could be lost if the control system stopped at the wrong time, doing something sensible is much better than not doing it. Thanks to our watchdog.
Meanwhile, our rivals, whose software engineers have written "Great code" which is crash proof...
Not a new product, but notable: Renesas's Synergy Software Package (SSP) is a completely free set of components designed to get your products to market much faster, assuming you're using one of their Synergy processors (various flavors of Cortex-M MCUs).
That's sure to elicit a giant yawn from experienced developers, many of whom have been burned by buggy or superficial support software provided by semiconductor vendors.
But I think this is different. The SSP appears to be very complete, with HAL drivers for most if not all of the peripherals on their MCUs and gobs of middleware. What I find most interesting is their focus on quality. The quality handbook and quality summary document exactly what policies and procedures were used on each component. In fact, I would recommend these processes for any firmware team trying to build verifiably-great code. And, the SSP comes with a warranty (though I have not been able to find the warranty's terms; the Synergy website is, in some cases, a bit muddy.)
But wait, there's more. Also included is Express Logic's entire (near as I can tell) software suite, including the ThreadX RTOS with its optional filesystem, GUI tools, networking, and more. Need development tools? IAR's suite is provided along with their runtime analysis tools. Prefer to use Eclipse? That is supported.
So, how free is free? The website indicates these are all totally free, with no licensing or royalty fees (access to the source code may incur a cost; it's not entirely clear from the site). Disbelieving this, I wrote to Renesas and they emailed me the following (bulleted points are my questions):
- It appears the Synergy software is free. Is this really true?
Yes. All software and tools are free so long as you use a Synergy MCU.
- If it includes the Express Logic components, can it really be free? I don't see how this works for them.
Renesas has negotiated a special licensing contract with Express Logic so long as ThreadX and the Express Logic components are used on Synergy MCUs. The follow-on question that customers normal ask is "Well, nothing is free, you are overcharging for the MCU." Actually, this isn't the case either. The predominant cost of an MCU is the die itself, specifically the code flash and SRAM of which Synergy devices have a lot of. The licensing cost is well less than 1% of the overall cost, so it is negligible.
- Does it include the IAR tools? Again, are these then free?
Yes, again this is included for free as well. IAR Embedded Workbench for Synergy can be used on any Synergy device. The license is not code-size limited nor time restricted. An engineering team can download as many copies of IAR EW as they need for their product development. Also included for free is C-STAT and C-RUN for code analysis, which is normally a separate purchase from IAR.
After getting that email I ordered a $35 Synergy dev board and will give it a whirl.
Note: This section is about something I personally find cool, interesting or important and want to pass along to readers. It is not influenced by vendors.