A ton of people responded to my comments about adding margin to firmware last issue. Here are a few samples.
Scott Winder wrote:
How can we add design margin to code?
The short answer: redundancy.
The longer answer could occupy volumes, but I'll try to be reasonably brief. In the automotive industry (where my familiarity lies), ISO 26262 was devised to address the issues of margin and reliability in the functional safety context (the standard is based largely on IEC 61508, which provides similar guidelines for the industrial sector). The precepts of the standard can be applied equally to non-safety-critical applications, but in general they aren't (at least, not fully), due to the high costs associated with the added hardware, not to mention the additional effort in the planning, development and testing stages.
This brings us to a critical point: adding enough margin to make a key module safe requires a full system solution. There are things that can be done in software to ameliorate the effects of errors (you mentioned exception handling), and there are certain architectures that are more tolerant of faults (for instance, a thoroughly vetted server that will continue doing what it’s supposed to do, even if the connecting client provides invalid instructions, or disappears). In the case of a more catastrophic failure, however (be it a crashed hard drive, bad RAM, corrupted flash, cosmic rays, or a weak solder joint), redundancy can only be ensured with additional (or specialized) hardware.
In the interest of detecting and handling internal (silicon-level) errors, several manufacturers now produce LSDC (lock-step dual core) microcontrollers; these devices feature a high level of internal redundancy and error detection mechanisms, and the code is ultimately executed identically on both cores. Post-execution, the results are compared—if they differ, a critical error is thrown and the system moves into its fail-safe state. Note that this also guards against many problems that may happen in production, because impurities or other silicon problems are unlikely to affect redundant systems identically.
Increasingly, car manufacturers are exploring the possibility of fail-operational systems, where a critical error in, e.g., electrically-assisted power steering will not result in a total reversion to unassisted control (this is a much bigger concern with the advent of steer-by-wire). The current implementation of fail-operational requires multiple (at least two) processors and heavily redundant circuitry to ensure continued operation in the face of one or more failed components. This helps to minimize the hardware portion of the failure risk in the system.
Moving back to software: no amount of additional hardware will make up for—for example—inadequate input handling or an unaccounted-for FSM state. If you have two processors running the same code with the same errors, you'll have the same problem at the end. The solution to this is to solve the problem in parallel, but different, ways. At the extreme, this involves two (or more) different teams implementing different algorithms on different hardware architectures, which are then combined into a single system. This approach can be scaled up (by adding additional parallel paths—three or more allows decision-making via a “voting” method) or down (by removing some of the redundancy). In the case of a system with a single processor and no hardware redundancy, using different approaches to solve the same problem in two different processes or threads will still provide some protection against the failure of one or the other, even if that protection only consists of the detection of errors that might otherwise have gone unnoticed.
A final point that merits attention is that the planning and testing processes are critical to the success of a highly-reliable system. I won’t dwell much on testing, other than to say that if most of the tests aren't defined during the planning stage, the planning is probably inadequate. Beginning in the early design stages, a well-planned system will go through a process called FMEA (Failure Mode and Effects Analysis) or one of its relatives (FMECA, FMEDA). During this process, weak points in the system will be identified; these points will be assigned a probability and severity in order to determine an overall risk, and based on that risk, countermeasures and tests will be planned. While the FMEA is typically applied to hardware, many of its aspects can be applied to software as well:
- In a real-time missile targeting system, what is the probability of false positives or negatives in the object tracking algorithm? What are the ramifications in each case? What are some possible mitigating strategies?
- For an assembly line optical inspection system, what is the probability of errant color identification? What will happen if a discolored product reaches the customer? Does this need to be avoided? If so, how?
- In a GPS navigation system, what is the likelihood that a vehicle will be identified as traveling on a road that parallels its actual path? Does this present a danger to the driver? What is the likelihood that the driver will act on instructions computed using the wrong path? How can this be handled?
Again, the ultimate answer is that reliability requires a systemic approach, and can only be partially addressed in software. However, there are some mechanisms that can be applied throughout the software development process that will decrease the chances of failure in a critical system.
I understand that I'm preaching to the choir here, but these are concepts of which a thorough understanding would have been very helpful to me early in my career. I hope they can be of use to you and your readers.
Bill Gatliff sent this:
It's often said that the one thing that makes embedded systems truly different from other types of computers is, we have to bridge the cognitive gap between our programs and the real world. Data processing happens in computer time, but sensors and actuators always happen in natural time, and our code and circuits stitch the two together.
As such, it's patently stupid to simply average incoming sensor data in order to estimate its value. And it's full-on negligent to use that averaging algorithm to "get rid of incoming noise". Sure, it'll work in the nominal cases, but what will happen if the wire gets disconnected altogether? An averaged stream of -1's is still... -1.
The correct solution is so natural, we don't even think of it. Consider the oil pressure gauge in the dashboard of your car. If that needle starts jumping about madly, we humans naturally start ignoring it because we know it can no longer be trusted: oil pressure doesn't jump around all that fast. We'll still log the problem for investigation, but we won't immediately halt the trip to summer camp. Or set the engine on fire.
The same is true for your speedometer, but in that case the seat of your pants gives you additional information that you don't have with oil pressure, coolant temperature, and other largely invisible values that are of the greatest here. Embedded systems don't usually wear pants, after all (but their operators often do, thankfully).
In the above examples, we humans aren't simply reading the gauges: we're also comparing those values against a mental model of what we _should_ be seeing, comparing all the available information against other types of sources, and trusting the ones that seem more plausible in the moment.
In temperature sensing for embedded work, we usually know the maximum rate at which a process can change temperature. When our sensor returns an error code or number that's way out-of-bounds, then, we shouldn't simply throw up our hands: we should switch to that estimation, instead. If nothing else, that will give us time to ask the operator if he wants to continue running, or shut the system down gracefully.
And speed? Most ground vehicles can't change velocity at a rate that exceeds 9.8 m/s2 unless they are mechanically coupled to the traction surface. There's the key parameter for your tracking algorithm right there.
The problem with the above examples are, to make them work we have to extract more information from our incoming data: we need the absolute value, but we also need to pay attention to the trend in case the data goes away. Yes, that's more development work, but I'd argue that the system is recklessly incomplete without it.
Our systems have to survive in the real world. And the real world doesn't stop just because you've lost a sensor or two.
Luca Matteini wrote:
I always read with interest your newsletter and share your thoughts about design engineering.
I think that your example about a necessary design margin, with the defective plane gear, covers as well the question on how to code better.
After all, what saved the plane wasn't the gear, but its strut, plus the rest of the plane structure (and not to forget a very good pilot).
So the only answer I can think of, is that for a safer software application, we need to have it structured in "safe blocks". The smaller you can make "self repairing blocks", the better you can isolate faults. Where "smaller" means inherently with a minor impact on system's stability.
As a mechanical (or electrical) system is designed to be redundant for safety, a software application should be designed to have different pieces bound together, where the functions and separation level should be adequate to the specific environment.
There are already some practical examples, often underestimated by developers.
If you code a graphic editor, in case of a crash, the user can loose her or his edited pictures (even that for someone can be a great loss). Usually it's safer to keep backup copies before starting editing, so the file system can be your first level of protection. That holds true even for a programmer editing code:
saving data, having a backup, is crucial.
At the same time, besides recovering from a crash (that already can be demanding), it's often important to restore as soon as possible a consistent state. A bricked motor control should at least /try/ to stop gracefully, instead of keeping everything run. A well partitioned application, could do that -- and when it's impossible to achieve, usually you think about catastrophic issues before, in hardware design.
An operating system can be designed and used as well as a controller, when a redundant system isn't applicable (cost, complexity). Running concurrently, you can design an application with main tasks and controller tasks, that can shutdown the former, if they start running wild.
High level programming languages often offer some control over data and function
separation: even that can be view as a method to separate well a single application. Avoiding to mix different parts of an application can help both to maintain it better and to complete different tasks even when in trouble. I remember a customer application, where the log file was left open all the time:
when the application crashed for an error, you couldn't see the error on log cause it was lost before any write (...).
These were all thoughts on the subject that easily popped to mind, and I think there can be dozens examples.
It's my opinion that a good solution is planning some form of separation and safety *before* coding, and always remember it *while* coding. And again: when debugging, any single anomaly, no matter how small, has always to be dissected and removed. Never to forget that compiler warnings do anticipate potential disasters.
Ian Stedman made a point I first disagreed with but have reconsidered:
The Space Shuttle has three computers in a little debating society, and they vote on what to do, and can vote to throw out a member who seems to be malfunctioning. The Shuttle also contains a fourth computer, specifically designed to have virtually nothing in common with the first three, that knows just enough to 'Get us home, please.'
Obviously, that was fearfully expensive, but the way silicon costs have been plunging;
Why not run four copies (or four different versions) of your software on four cores of a single MCU, and vote on the outputs?
(Yes, if all four cores have a common flaw, it won't help, but that's a hardware problem... ;-) There are lots of other complexities, of course (who votes for the voting system?) but hey).
At huge expense the Shuttle had a completely independent set of software. Obviously, that's not an option for most embedded systems. However, given the increasing concerns with bit flips from cosmic rays, in some circumstances Ian's approach makes sense. Parts are already available, like TI's Hercules series which have twin ARM cores executing in lockstep while looking for discrepancies in behavior. Given that silicon costs tend towards zero (or a very low number) and that most embedded systems use MCUs that are not pushing the state of the art in fab technology, I expect we'll see more of this.
Per Söderstam, among other things, recommended Henry Petrosky's book, which I heartedly second:
Here's my thoughts on the subject of "margin" in software.
In the case of the airliner nose gear the design is not simple but I imagine more straight forward. It is a well-defined, bounded part where several well-known solutions exist in the literature. The problem becomes one of deciding on which to choose, tweaking it and fit it to the available physical space while ensuring enough margin for it not to break, even when the wheel is turned sideways.
The mechanical engineering profession has had 200 years of perfecting the art of figuring out what, where and when a metal thing will break. Remember, when Britain was overgrown with collapsing railways bridges and tracks broke at a rate that caused quite a stir in the general populace and the very concept of train travel was, at some points, on the line. I kind of recommend "To Engineer is Human -- The Role of Failure in Successful Design" by Henry Petrosky, on the subject of mechanical failures.
Mechanical and many other engineering disciplines also have the added advantage that what they work with is physical. Anyone, even the stakeholders, can see the difficulties of building a bridge, house, car, whatever. It is possible, though not necessarily easy, to point out the current problem and valuate proposed solutions.
Software, on the other hand, is Magic. It is invisible to the public and stakeholders. Sorry to say, even many in our own profession have difficulties comprehending the complexity that is inherent in a system implemented in software. This is unfortunately all too often the case as you climb the hierarchies of management, where interactions with customers produce change and time and resources are allotted. Imagine if there was a mapping between a piece of software and a mechanical equivalent that could expose the complexity of a software based system! That you could show a change request carrying drive-by manager/customer/stakeholder, or for that matter a "hey, my solution works so why not" co-worker. I'll bet that would cool things down to a level where margin can be properly implemented.
Many of the tools and the knowledge to build margin into software based system exist and are available today. Theories on fault-tolerant HW/SW systems are there and are refined, as where the theories on railway bridge-building in the 19th century. Tools abound and standards are being written at an alarming rate.
The main problem is to make the magic visible, to ourselves to see where margin must go and to everyone else so they can provide the resources and work environment to make it happen. This will happen as we ourselves start embracing the notion of margin, the universities continue their thinking and the stakeholders mature in their vision of what is possible at what price.
Let's hope, and work toward, that we can shortcut this process given what we know the mechanical and building communities have painstakingly learned over 200 years of failure.
Steve Paik wrote:
You hit the nail on the head with the analogy of 90% being an A. I have always said that SW engineering is challenging precisely because one wrong bit out of 1 million bits (for 128K) could throw your whole program off. 90% correct in the embedded world essentially means you have nothing.
This goes to the age old question of how do you engineer QUALITY and ROBUSTNESS into your code base? Both of these things are really hard to measure empirically (Zen and the Art of Motorcycle Maintenance is a must-read discussion on quality) so these attributes get lost. Yet, customers do have a sense of them and know quality when they see it.
In terms of adding margin, I don't know that you really can, aside from over specing the processor speed and amount of memory available. Of course, the business side wants you to cut as much as possible to save cost, so you can only spec so much flash, ram, etc.
To that end, I find that solid fundamental processes, like the ones you teach in your courses, are the best way to build quality and robustness into the product. In other words, good upfront requirements analysis, clean architecture and design, low cyclomatic complexity in the codebase, etc. After that, it becomes a matter of how quickly I can expose bugs and fix them:
1) If the code detects an error, fail HARD. This means sprinkle ASSERT() everywhere and leave them in the production code. Why? It's easy to check for a null pointer and just return, hoping the layer above you deals with it. However, if the layer above you doesn't deal with it properly, you're just passing the buck and eventually the program will fail somewhere else and be a mess to debug. By asserting immediately, you kill the system and create pain for users. Pain results in highly visible bugs that get fixed right away.
2) Make it as simple as possible for users to report bugs. My asserts report filename + linenumber. If the user needs to get a whole stack trace, or decode a blinking LED, the chances that they'll get you a bug report are slimmer, unless it's really frequent.
3) Also make it easy to get a build number, SVN number, or some other way to identify the build. It's essential to help track where/when bugs are introduced, and whether they're already fixed.
Good design, coupled with good testing and ease of bug reporting go a long way into building up the quality and robustness of the product. If the HW is failing, there's not much the SW can do unless you specifically design for it. Perhaps in applications like space systems, people do design the SW to deal with memory errors and such, but most everything I have seen is a watchdog timer or an assert that simply resets the processor and prays it doesn't happen again!
Exception / error handling needs to be part of the requirements, otherwise it will not be addressed properly. For instance, if I'm making an ECU for a car, it may be totally unacceptable for the ECU to reboot while the engine is running. In this case, if a sensor is faulty, we need to decide whether the system can run at a reduced efficiency and design it to handle the case appropriately.
The assert() is a great placeholder for "todo or future work", but it isn't the end solution all the time!
Tom Archer wrote:
What you wrote is true...most of the time. In general, mechanical and structural parts have the luxury of being "sloppy." There's probably a better word, but "sloppy" will do. You ended with the question, "How can we add design margin to code?" My short answer is IF ALLOWED, anticipate failure mode's consequences. It is important though to recognize that while we all seem to end up dealing with consuming foreseeable messes, uncountable people and their managers make good, conservative, thoughtful decisions every day that avoid ugly consequences.
There are however, mechanical failures in huge numbers, some in a sense deliberate, some are design errors and many are manufacturing errors. Parts on your car for example wear out (fail) at some point by "design."
Back to your mechanical engineering thoughts: A good, formal example is the design requirements in the ASME Boiler and Pressure Vessel Code. We know a lot about the end use loads on boilers and pressure vessels and they're usually designed based upon "requirement" loading that exceeds their expected use and are equipped with safety devices to limit actual loads to slightly above expected loads. Additionally, the "Code" generally mandates a safety factor of 4 based upon minimum specified yield strength of the construction material. Actual material is typically 25% or more stronger, but actual is irrelevant for design purposes. So, with "compound" safety factors, the actual safety factor can easily be "10" or more and fatigue is ignored (there are provisions for corrosion). In the end, it's an economic decision; pedestrian materials (steel) are cheap relative to everything else in the chain.
Aircraft are a bit different and your example of Flight 292 is, while odd to look at, probably not exceptional. The actual typical loads on landing gear are amazing; airplanes are heavy. (I've been on an A 320 when an engine exploded!) Note that the skilled pilot brought that nose down very, very slowly. The actual loads on that gear were probably no worse than normal. You wrote "Something went wrong with the system used by the air industry to eliminate known defects." Not so; there were prior examples of the failure and the causes were known, but the probabilities of failure were low and neither loss of life nor loss of aircraft were expected consequences; the economics were such that there was no emergency, so there was no urgent fix.
But your point is valid and Air France Flight 447 is probably a better example. The probable initiating cause of the crash, a frozen pitot tube leading to faulty airspeed indications, was known but the cascading effect to failure was not anticipated. The San Bruno California pipeline blast in 2010 is another example, but that failure was anticipated by the technical professionals and dismissed for the short term by management.
But because materials make such a functional and economic difference over the life of an aircraft, a lot of work goes in to understanding and managing the loads and life of aircraft. Typical design safety factors can range from 0.8 to 1.2 relative to max loads. Because the margins are thin, triple redundancy and real time monitoring are employed. It used to be that the monitoring focused on moving parts and power systems, but as the technology has evolved that's been extended to critical structural components as well, especially in military aircraft.
As more, modern, composites and hybrid materials are created and used, you can expect to see more failures. As a new generation of "computer savvy" designers emerges, you can expect to see more failures. Corrosion for example is a ubiquitous phenomenon that creates unpredictable failures. For about 50 years, designers have recognized the problem and mitigated it in so many ways that the problem rarely manifests itself, so it's been forgotten as a problem. New specifications no longer include requirements for corrosion protection, so we'll see a repeat of the old cycle. Polymers (and composites) may not "rust" in the traditional sense, but they change over time, sometimes in unexpected ways.
Software isn't so very different in most cases. If allowed, and the economic and performance price is acceptable, software designers will include redundancies, checks and corrective actions as appropriate. A robot will shut down if an encoder signal is lost for example, though there may be significant debate about what actually will or should happen when that signal is lost. There's near universal understanding that somehow, someway, control signals will be lost or corrupted, so the system has to deal with that near certainty.
Where software and hardware fall on the same sword is un-anticipated or more often perhaps, the denied failure modes and consequences. Arguably, our memorable "engineering" disasters, such as Challenger, Columbia and Deepwater Horizon are cases of "denial" rather than surprise. We make a formal, technical distinction between risk and uncertainty. "Risk" includes a description, a known probability of occurrence and known consequences. "Uncertainty" lacks a probability distribution, and therefore is often "assigned" a notional (or perhaps "emotional") probability some epsilon greater than zero, low enough to justify ignoring the consequences.
Again, on the more positive side though, every day there are the uncountable people in all vocations, software included of course, who do make good decisions, who anticipate failure modes and consequences, who refuse to take the shortcuts. Sometimes it gets ugly, people get fired or demoted.
Ray Keefe always has valuable input. He wrote:
I loved the section "On Margin". So painfully true. As SW developers we know that 1 typo can lead to a devastating failure of a system. As you said, 1 bit wrong can lead to disaster. No other discipline has this degree of fragility or dependence.
When we talk about fully tested, it gets to be quite difficult to be confident in. I can write SW with 100% code coverage and unit tests for all the modules, but I can't as easily test every aspect of the tool chain or the silicon itself.
And that is before a cosmic ray flips 1 bit in memory.
I also find that embedded SW developers who have studied hardware first approach test differently. As an electrical engineer who now mostly writes SW, I started life using multimeters LSAs and CROs (now I'd use a DSO) to make sure the hardware was doing what I needed and within margin. I use a vernier caliper and micrometer to make sure mechanical parts are made to drawing. I build and submerge enclosures that are meant to be IP67 rated. Doing the same with SW is a good thing. It is a mindset before it is a practice.
I do think we can add margin to code. And on multiple axes. And I agree it is hard to get to 100%. Here is a short list of obvious and not so obvious way to start doing this:
1. Use know good development practice, design patterns, coding standards
2. Reusable code comes from design for reusability. So design with that in mind. Reusing fully tested code saves time, money, smoke, hair, everything
3. Use more inclusive tests eg. >= rather than just == for increment to threshold checks
4. Refresh hardware initialisation and core data structures for mission critical code
5. Use watchdogs
6. Always, always, always check pointers. No NULL dereferencing.
7. Design with object oriented principles in mind. You can do this in C. This design philosophy helps design more easily testable code and code that is likely to be fully testable. But you really have to embrace checking every pointer if you do this.
8. Measure your quality. Use static analysis, cyclomatic complexity, stack checks, look at the linker output map, do code reviews, design reviews, code walkthroughs, FMEAs (Failure Mode Effect Analysis)
9. Write error handlers that can exit gracefully and provide a reliable and safe system restart strategy. The HW designers need to be in on this one. Pull ups and pull downs on ports etc.
10. Do the HW and SW design as a system
FMEA is a routine practice in mechanical design teams and I also use this with electrical/electronic and SW teams. The basics are 3 axes covering:
- How likely is it that this fault will occur
- How bad will it be if this occurs
- How easily can a user or even an expert tell if it happens
Safety critical design teams add a fourth axis: what is our history of having a problem in this area!
Give them a score out of 10 for each (I can provide an example of a rule set if there is interest). Multiply the scores together. If you score more than a certain threshold, eg. 125 for the 3 axis version, redesign is recommended. More than a higher threshold, say 180, or 8 or more on any individual axis and redesign is mandatory. We use this concept when we review code and in particular error handling and recovery.
On one project, we had a score of 360 for one failure scenario. This meant almost certain to occur within the warrantee period, possible damage to the unit, and no-one could tell it had happened. After some review, we deleted the feature as no-one could find a way to mitigate the scenario and it was a 'nice to have'.
So yes we can add margin by design and thinking about the system, not just the code. But I am also resigned to the knowledge, for now, that getting to 100% probably can't happen most of the time.
And I also enjoyed the telecoms equipment story. Yet another example of an unexpected dependency at the system level.