Go here to sign up for The Embedded Muse.
TEM Logo The Embedded Muse
Issue Number 270, October 6, 2014
Copyright 2014 The Ganssle Group

Editor: Jack Ganssle, jack@ganssle.com
   Jack Ganssle, Editor of The Embedded Muse

You may redistribute this newsletter for noncommercial purposes. For commercial use contact jack@ganssle.com.

Contents
Editor's Notes

Ad

You still have a couple of days to get a $50 discount for early signup.

Did you know it is possible to create accurate schedules? Or that most projects consume 50% of the development time in debug and test, and that it’s not hard to slash that number drastically? Or that we know how to manage the quantitative relationship between complexity and bugs? Learn this and far more at my Better Firmware Faster class, presented at your facility. See https://www.ganssle.com/onsite.htm.

Embedded Video

The videos are on hiatus for a couple of weeks due to an excess of other activities here.

Quotes and Thoughts

"The trouble with programmers is that you can never tell what a programmer is doing until it's too late." - Seymour Cray

Tools and Tips

Please submit clever ideas or thoughts about tools, techniques and resources you love or hate. Here are the tool reviews submitted in the past.

At ARM TechCon last week Micrium demonstrated the latest version of their uC/Probe that monitors firmware in real time. Wow! I've used it in the past with great success, but the new version offers many new features and a stunning UI.

More Responses to Margin in Software

In response to some of the reader comments in Muse 269 Bill Gatliff wrote:

Disappointing 'Muse this month!  Not your prose, but some of the responses to your question about design margin are truly shocking to me.  Sorry for the long reply, but this is something I'm really passionate about and I just can't let it go yet.

There is no notion of "design margin" in software!  The concept can't be applied that way, because software is purely a solution description language and is NOT the solution itself.  Surely I'm not the only one who understands this.

In mechanical engineering, you have a blueprint that describes the strut (to continue your example), and then you have the tangible, fabricated strut itself.  The blueprint can specify a design that will exhibit margin, but only the strut can demonstrate that margin when put to test.  You can change the colors of the lines in the blueprint, render to heavier paper, and even convert from metric to english units, but if you make no other changes then the resulting strut will still behave in exactly the same way.

Software is merely the blueprint, it isn't the strut itself.  We know this is true, because an almost infinite set of software possibilities can describe the same solution, just like different blueprints can describe the same strut.  And in the same way a machine shop crafts a strut per the blueprint, a CPU "fabricates" a behavior according to the code each time you power up the system.

Software that's designed to "fail HARD" at the first sign of error has absolutely NO margin whatsoever!  It's equivalent to a strut that's designed to survive only perfect landings.  That's not just bad, it's a system that's designed specifically to BE bad.  No, thanks.  As in, "I had to cut out the profanity I'd prefer to use here to get my point across."

A better approach is to define algorithms that fail gracefully under overload, and use models of system data to ride through brief periods of sensor noise.  Even better is something like a Kalman filter that can synthesize a missing measurement from other data.  Think in terms of degradation, rather than failure, and design for that.

When an engine approaches a temperature limit, for example, the allowable heat-generating horsepower output can be reduced as a function of how close to the limit we are.  Not only does this give the operator a sense that failure is approaching so he can limp to the side of the road, it also makes the machine more self-protecting: abruptly shutting down a hot engine is pretty devastating to turbochargers, as well as to the drivers behind you.

What you end up with if you implement the above, is a ratiometric relationship between temperature and torque, rather than a long list of if-then statements.  The latter is very difficult to test, by comparison.  It's also a lot more code.

We can get a lot of fault-inducing stupidity out of today's systems, too.  Never store redundant data, for example, unless it's for the purpose of data recovery.  Otherwise, you risk asking the system what its favorite color is, and getting back an answer like "blue!... no, red!!"  Redundant algorithm implementations, like checksum calculators, are also evil.  And persistent flags that get set based on repeatable calculations of non-changing data are in a category of evil all to themselves.

If the terms "state scrubbing" and "run to completion" don't mean anything to you, then please fix that.  And don't write any more code until you do.

Finally, with just a little feedback we can make software systems that are remarkably tolerant of mechanical failure.  A driver instinctively raises his foot when he detects he's not getting typical acceleration, in order to stop wheel spin.  See?  Even everyday commutes involve a mental transfer function between foot position, the loudness of the engine, the pressure against the driver's backside, and the rotational speed of the wheels!  A computer can do such things even better than humans can, if we write the code to do so instead of just throwing assert() statements into code that's inadequate in the first place.

I can even fly my R/C helicopter when a gear suddenly breaks on one of the flight surface control servos.  When I notice that the machine is limiting its response to my stick inputs, I ease up on other inputs in order to keep the related angles between flight surfaces inside the stable ranges.  I can't fly fast anymore, but I might just land with minimal additional damage.  In contrast, if the software onboard the helicopter detects the error and just halts, my bird immediately becomes an expensive projectile EVEN though most of the hardware isn't damaged.

In the big picture, I'm increasingly convinced that there's no such thing as "error" in embedded work because our machines don't really have the option to stop working---real life goes on, after all.  I'm also pretty certain that if-statements represent abrupt changes in system descriptions, and strongly suggest that you've band-aided around a shortcoming in your solution rather than fixed it.
Why do we spend so much time obsessing over ways to avoid fixing the shortcomings of the solutions we describe in software?  Let's just describe better solutions, and move on.

He replied:

As to null pointers, there are three root causes that come to mind: you didn't provide some necessary input data, your ran out of memory elsewhere (see below), or you've got a calculation fault that produced a bogus result.

For all of these, the key to survivability is to design a solution that doesn't demand that data... or can at least ride without it for a while.  This usually means a strategy that looks more like closed-loop control than simple, procedural, action/reaction code.  (Those autonomous loops make it much easier to do the director-actor thing you mentioned, too.)

Pointers always refer to dynamic data, otherwise you'd just refer to the symbol by name.  As such, it's data that is defined to come and go---and so you always need to deal with both cases!  If the data doesn't appear when expected, then either (a) there's nothing to do in response, because there's nothing to recalculate; (b) we need to re-send the request, and we'll use the previous result or an estimation in the meantime; and/or (c) the situation has persisted for so long that we need to initiate an affirmative reaction.  The details of the situation determine which of those three choices are most appropriate, but if you find that can't make any of them then your solution (not just your description thereof, a.k.a. the code) needs some serious rework.

As for malloc failures (or running out of memory in general), you use dynamic memory allocation when you don't know how much stuff you'll have to deal with in advance.  Among other things, this gives the system "elasticity" if you feed it an unexpectedly big blob of data, or a burst of network traffic comes in, or something like that.

But a system environment that begs for dynamic memory allocation is also one where overrun is a de facto possibility---if you knew the upper limit, you would have just planned for it!  When overrun happens, your system MUST have a strategy for dealing with it.  Otherwise, your work isn't finished.

Some systems can just throttle incoming data until they've worked down the stuff they already have.  Others can drop the old stuff on the floor, or the new stuff.  If none of these are an option, then you've got to refactor your solution in some way to fix that.  Just blowing up when reality doesn't limit itself to your system throughput limitations is reckless and unprofessional.  It's also increasingly not an option--- although really, it never was.

Do you think that Apollo 11 would have survived a computer that just threw its hands up at the first sign of trouble?  And I swear, if I have to replace another battery in my TV remote just because one of the buttons got pressed when the thing disappeared between the cushions, ... or I have to throw the remote away entirely just because one button got stuck...

What I'm getting at is this: null pointers and malloc failures are just what code does sometimes.  If your overall solution can't deal with them, then your solution is impossibly fragile and incomplete.  For every development problem I've encountered, there was always a way to describe a resilient solution---though finding it often took a while.  (Aside, the code usually gets easier the longer you look at the problem.)

If ordinary code-related faults are the bane of your existence, then it's likely that you've approached the solution in a fragile, short-sighted way.  Throwing in assert statements and error codes everywhere won't fix a bad solution, it'll just fix the description for that bad solution.

A bad solution is still bad, no matter what the code looks like.

Nick P wrote:

The way I did it was to specify what each function could or couldn't do. A checker runs side-by-side with the main software watching its actions and state transitions. Each component has fail-safe, recovery, or other strategies that are suited to that component. The checker system can inspect the internal state if an error happens, typically to log it. It can also put it in a good state per the failure/recovery policy for the system's components. Of course, you have to design your whole stack to do this and you will have to port the lower layers of it to each new hardware. Lockstep execution units helps, too.

That said, I kind of question the need for margin in our business if it's a custom system (eg in embedded). Aside from the main CPU & memory, software can be programmed to detect failures in any critical or non-critical component to take appropriate action. Redundant components can be added to reduce effect of that aspect. Lock-step can be added to reduce effect of CPU & memory among others. Far as software itself, EAL6-7 class development processes have built systems with near zero defect levels and provable security/safety properties. The systems I've seen that use that kind of process alone showed no safety or
security failures in the field. Tools like SPARK Ada and Astree C combined with processes like Fagan or Cleanroom provide a lower cost version of that. Likewise, NonStop-style systems just keep on going without total system failure despite individual units and apps being susceptible to all kinds of problems.

So, I just don't think it's as big a problem as many people think. The real problem is that *the vast majority of systems, embedded or not, run complex software created using provably unsafe development processes/tools on unsafe architectures often containing many significant points of failure.* Then, their stuff fails. Surprise! It's really market choices more than anything. That's true everywhere from embedded to servers.

The ultimate solution to reliable (and secure) systems is a [safer] language supporting contracts/invariants (eg Ada 2012) + a rigorously proven runtime compiled with a certifying compiler onto a number of execution units running in lockstep, each of which is an inherently safer processor. Examples of such execution units include crash-safe.org (SAFE processor) and jopdesign.com (JOP processor). Seeing as systems from the 60's-80's did some stuff like this (with success), I'd say the general concept is beyond understood: it's been done many times over in the field and just not applied by majority in present. Yet, the market continues in the failure-prone directions...

(sighs)

== Considered Harmful?

In his comments in the last Muse Ray Keefe suggested preferring ">=" (or, of course, "<=") to "==" in making decisions. His thinking is that adding the greater/lesser operator makes the code more robust in case it doesn't exactly hit an equality condition. It's an interesting idea that assumes we make mistakes.

But should we take it a step further?

If our design tells us "==" will work, then a situation where the greater-than construct takes action means there's something we missed. Perhaps a better approach is to, first, add a comment so future maintainers understand what we're doing, and, second, instrument decisions where we've used the broader ">=" construct. An example:

if(i >= SOME_VALUE)         // Note: >= used just as insurance
  {
  assert(i ==  SOME_VALUE);// If assert fires there's a design error
  more code
  }  

(To be clear, this makes sense only in the case where i is incrementing in some manner).

As I have written before, the default behavior of assert consumes a lot of memory. It's trivial to write a more firmware-friendly one that uses few resources. I have been astonished at how little code an assert can generate on modern CPUs even when the argument looks complicated.

What do you think?

Sequence Points

Chip Overclock's most recent blog posting has a fun snippet of code. Do you know what this will do?

#include <stdio.h>
void main() {
 int x;
 int y;
 y = (x = 3) + (x = 4); 
 printf("%d %d\n", x, y); 
 }

For the answer go here.

Jobs!

Let me know if you’re hiring embedded engineers. No recruiters please, and I reserve the right to edit ads to fit the format and intents of this newsletter. Please keep it to 100 words.

Joke For The Week

Note: These jokes are archived at www.ganssle.com/jokes.htm.

Jim Hanley was told a story rather like the one in Muse 268:

This reminds me of a story I heard years ago from my father who worked for NYT (New York Tel).  They supplied the audio feed to WKBW AM, a high powered radio station whose antennas were located in the town of Hamburg, south of Buffalo.  A woman complained that when she flushed her toilet, she could hear the radio station faintly in the bathroom for a while.  She lived on Big Tree Road, right across the street from the antenna farm.

After initially dismissing the reports, they finally investigated and found that the soil pipe formed a nice antenna/ground combination, and corrosion had apparently formed what amounted to a diode detector.  All that was needed was a good flush to complete the circuit.  The toilet bowl and tank served as the "loudspeaker" if you will.

Advertise With Us

Advertise in The Embedded Muse! Over 23,000 embedded developers get this twice-monthly publication. .

About The Embedded Muse

The Embedded Muse is Jack Ganssle's newsletter. Send complaints, comments, and contributions to me at jack@ganssle.com.

The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster.