The Embedded Muse 357

Go here to sign up for The Embedded Muse.

The Embedded Muse
Issue Number 357, September 3, 2018
Copyright 2018 The Ganssle Group

Editor: Jack Ganssle, jack@ganssle.com

Jack Ganssle, Editor of The Embedded Muse

You may redistribute this newsletter for non-commercial purposes. For commercial use contact jack@ganssle.com. To subscribe or unsubscribe go here or drop Jack an email.

Contents

Editor's Notes
Quotes and Thoughts
Tools and Tips
Freebies and Discounts
On Test - A Story
On Hardware in Asynchronous Sampling
More on Increasing the Resolution of an ADC
More On Whether 'Tis Nobler to Initialize or Not
This Week's Cool Product
Jobs!
Joke for the Week
Advertise with us
About The Embedded Muse

Editor's Notes

After over 40 years in this field I've learned that "shortcuts make for long delays" (an aphorism attributed to J.R.R Tolkien). The data is stark: doing software right means fewer bugs and earlier deliveries. Adopt best practices and your code will be better and cheaper. This is the entire thesis of the quality movement, which revolutionized manufacturing but has somehow largely missed software engineering. Studies have even shown that safety-critical code need be no more expensive than the usual stuff if the right processes are followed.

This is what my one-day Better Firmware Faster seminar is all about: giving your team the tools they need to operate at a measurably world-class level, producing code with far fewer bugs in less time. It's fast-paced, fun, and uniquely covers the issues faced by embedded developers.

Public Seminars: I'll be presenting a public version of my Better Firmware Faster seminar outside of Boston on October 22, and Seattle October 29. There's more info here. Or email me personally.

On-site Seminars: Have a dozen or more engineers? Bring this seminar to your facility. More info here.

The last issue sparked a lot of reader ideas. Keep 'em coming!

Latest blog: On Evil.

Announcement: 16th Annual Embedded Systems Workshop 2018 in South East Michigan

IEEE South East Michigan Computer Society Chapter is conducting its 16th Annual Embedded Systems Workshop 2018 on Saturday, October 20th, 2018. This free workshop is open to all. Students (and teachers!) and practicing engineers are encouraged to attend.

The event will be held this year at Lawrence Technological University, 21000 West 10 Mile Road, Southfield, MI, 48075

The aim of this event is to disseminate knowledge, directly benefiting the attendees, at the same time to improve the technology skills pool, indirectly boosting the Michigan economy. Speakers and experts from the embedded systems industry will be making presentations, and will also be available for discussions and networking throughout the day. In addition to the technical presentations, there will be industry information display and professional recruitment tables. Use this opportunity for networking with other engineers, industry experts and embedded enthusiasts.

Those interested in attending should register by October 18 to assure that sufficient food is ordered to accommodate their dietary preferences and requirements: http://bit.ly/esw2018

Quotes and Thoughts

"While technology can change quickly, getting your people to change takes a great deal longer. That is why the people-intensive job of developing software has had essentially the same problems for over 40 years. It is also why, unless you do something, the situation won't improve by itself. In fact, current trends suggest that your future products will use more software and be more complex than those of today. This means that more of your people will work on software and that their work will be harder to track and more difficult to manage. Unless you make some changes in the way your software work is done, your current problems will likely get much worse." - Watts Humphrey

Tools and Tips

Please submit clever ideas or thoughts about tools, techniques and resources you love or hate. Here are the tool reviews submitted in the past.

Freebies and Discounts

This month we're giving away a 30 V 10 A power supply.

The contest closes at the end of September, 2018.

Enter via this link.

On Test - A Story

Daniel McBrearty wrote:

I could call this "the reason I NEVER trust software", or "the more catastrophic and irregular a bug is, the less likely it is to be reported". (Perhaps I could call that McBrearty's Law).

Here's the story, suitably anonymised to protect the innocent (and me as well).

Some years ago I worked for a well-known international vendor of software, on one of their flagship products, which was advertised as a very high reliability product 99 and lots more 9's, that kind of thing. The product was fairly mature, having been in the market for at least a few years, with an international team of engineers taking care of it.

We had a bug report come in from a distributor that was a bit hard to believe. One of the Linux boxes that formed the system was reported to come to a sudden halt, leaving only one process running. Pretty bizarre stuff, and it got handed to me to have a look at. At least one senior person on our team thought the distributor had lost a few nuts and bolts, or some other form of hardware problem.

However, when I took a look I couldn't dismiss this in that way, bizarre as it was. For one thing, the engineer at the distro had done a great job of writing some log scripts which gave all kinds of forensic info, and they had, with a lot of patience, managed to reproduce just once, by loading the system heavily.

So I spent some time trying to analyse the logs in different ways, as well as talking to our software designers about possible causes. Everyone was a bit baffled though.

Well, to cut the story short - after a week or two, they reproduced again, and again with logs. And the two sets of logs were just enough to see the "smoking gun".

It turned out that there was a core process which forked under some kind of conditions, and created a copy of itself, and then killed itself, using a "kill PID" system call.

However, under some kind of circumstances, and due to one missing line of code (a return statement) I believe, it was possible for the system to use "-1" in place of the PID. A quick look at the man page tells us what that does (I just tried it on VM to prove to myself I am not making this all up).

Now, I should emphasize that this was not at all a bad or dysfunctional team. In fact, I'd say that the team that wrote this code consisted of some of the better engineers - very experienced and conscientious guys with quite decent processes. Of course they were pretty shocked to see the bug, which turned out to be something that had been there since the very early days of the system.

But the most educational aspect of this for me was the aftermath. We came clean, issued a patch with a high severity level, and congratulated and thanked our distributors for their sterling work.

But here's the thing that really got me thinking : this bug had been in the code for some years, and was never reported. Yet - after we issued the patch, reports of the behaviour started coming in regularly. Of course we just applied the patch, and all was well.

But did the bug start happening more frequently just because we identified it? Of course not. It had been happening all along, but probably never twice to the same person, and everyone simply rubbed their eyes, said "what ... !?" and reset the box. (Without the log scripts there was of course NO way to know that that one process had committed software genocide. It was a dead box that had suffered what looked for all the world like some bizarre hardware issue, and you couldn't even log in to see what had happened. Even if you had an open terminal, it was now dead.)

And that is why I NEVER trust software - no matter how good the guys are that wrote it. (If I want reliability I try to do the important parts of the job with relays and switches. I'm actually not joking.)

As a side remark, I'd like to remark that I feel that the kind of skills that make really good engineers are almost the exact opposites of what are considered "good social skills". Without the unwillingness of our distributor to drop the issue, even when some of our senior guys thought they were frankly insane or incompetent, matched with willingness to invest a lot of time in investigating - we would NEVER have found that bug. I doubt we'd even have known it existed. (Luckily it was not a product on which human life depends.)

Thanks for reading this long missive. I feel that it is a tale that has a lot to teach us about the nature of reliability, and the human, rather than technical, reasons that it is so hard to achieve.

Testing is critically important, but it won't ensure the code is correct. It's just one of many filters we need to apply.

On Hardware in Asynchronous Sampling

In the last issue I wrote about some problems with using hardware to deal with an input wider than the CPU's bus. A number of people had thoughts about this.

Ian Stedman wrote that the async timer problem would not exist if the timers counted using Gray Code. With that, only one bit changes at a time. This is the Gray sequence:

I don't know of any timers that count in Gray Code, but Digi-Key lists 202 encoders that do, out of 7274 total encoders. Code to convert Gray to binary is here.

Craig Ross wrote:

Years ago, I implemented an asynchronous sampling routine that just did three reads. First read captured the high order word, second read captured the low order word, third read captured the high order word again. If the second read of the high order word matched the first read of the high order word, then there wasn't a roll-over while the low order word was read. If the two reads of the high order word were different, then a roll-over had occurred, so the sequence was repeated. At the time, it sounded like a good fix, do you see problems with this approach, other than the number of steps?

David Wyland offered this about metastability:

I had a metastability problem early in my career, in 1971. It was in the interface clocking for a floating point processor APU. The FPU clock and the CPU clock were not the same, and were not synchronized. There was the dreaded error about once every 20 minutes to 1 hour. And no clue as to its source.

I found out about metastability much later, when I was working in applications at IDT. I read an TI paper about it, and I wrote a paper about metastability in systems using the dual port RAMs IDT made, where the clocks of the 2 ports were asynchronous. A little deeper study showed that when you had metastability, the settling time could be multiplied by up to 10X.

When you clock in a signal into a FF that is changing during the rising edge of the FF clock, the output takes ~10X the time to settle. At the time, the settling time of a 74S74 was 5 nanoseconds, so the metastability settling time would be ~50 ns.

The probability of settling time extension falls off rapidly and exponentially with increasing time. By the time you hit 10X, the probability of settling taking that long is vanishingly small. So 10X is pretty conservative, barely measurable in practical terms.

So how to cure the problem? Use a 2-stage shift register. Clock the signal you are sampling in to the 1st flip flop, and clock the output of the first FF into a second FF. The first FF output will settle to its value before being clocked into the second FF. And the second FF output will be clean, with a nominal settling time.

This works if the system clock is at least 10X the settling time of the flip flops in the system. And a minimum 10X ratio of clock period to FF settling time is a good design margin.

The output from the second FF will be delayed by 1 clock time, but this is seldom a problem in system design.

More on Increasing the Resolution of an ADC

Another subject in the last issue was using noise to increase the resolution of an ADC. Readers had lots of useful ideas.

Phil M had the intriguing suggestion of using a triangular wave:

4^n oversampling for n bits is correct for Gaussian/white noise, but if you can instead add a triangular voltage ramp to the signal (with a slope of 1 LSB/sample), then you improve this to 2^n, so you get one additional bit for every doubling of the oversampling rate. You can easily generate this ramp signal with a few smartly-chosen capacitors and resistors if the sample clock exists as a physical signal, or with an unused GPIO.

Jim Haflinger had an excellent idea: read, for instance, ten samples. Discard the highest and lowest, and average the remaining. The outliers might be, well, outliers.

More on Whether 'Tis Nobler to Initialize or Not

Finally, another subject in the last Muse was about initializing variables. Is the BSS zeroed or not? My take is that I always explicitly initialize.

Rod Chapman, one of the smartest guys I know, is a SPARK advocate:

Chocolate teapot time again from the MISRA committee... this rule is marked "System and Undecidable" in their classification so it requires whole-program analysis, could be really slow, and you are still doomed to some combination of false positive and negatives from a static analysis tool.

Another approach: design the language so that data-flow analysis is Sound (0 false negatives, right?) and computed in P-Time. Sounds like a wacky idea??? Oh no... SPARK had this in 1987... :-)

Steve Karg wrote:

From section 3.5.7 of the C89 standard:

If an object that has static storage duration is not initialized explicitly, it is initialized implicitly as if every member that has arithmetic type were assigned 0 and every member that has pointer type were assigned a null pointer constant.

That inspired me to look more closely at the 500+ mind-numbing pages of the C99 standard, where I found in 6.7.8:

If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate. If an object that has static storage duration is not initialized explicitly, then:
- if it has pointer type, it is initialized to a null pointer;
- if it has arithmetic type, it is initialized to (positive or unsigned) zero;
- if it is an aggregate, every member is initialized (recursively) according to these rules;
- if it is a union, the first named member is initialized (recursively) according to these rules.

Arithmetic types get assigned to a "positive" zero. In the next letter from a reader, Tom Oke talks about a Cyber 172 machine, which used ones complement math. That, for younger readers who never had to deal with this, supports both positive and negative zeroes.

Tom Oke had a problem from the olden days:

Many years ago (very early 1980's) I was a systems programmer for a University Academic Computing Services and we had a CDC Cyber 6400. This was a machine with 10 peripheral processors, each with 4K of 12-bit words, and a central process running with 64-bit words.

It booted by reading a panel of 16 x 12 toggle switches (looked like they were good for about 5 amps each). So the boot program to PPU 0 could be up to 16 x 12-bit words which were read as the lower area of memory and then executed.

There was a table of the settings for all 16 rows of switches posted by the manufacturer on the panel beside the switches.

An upgrade to the system (to a Cyber 172) brought us a system that now sported additional banks of switches, with the same boot program, listed in the 16-row table.

As you can guess, one day the machine would not boot, and checking the switches against the table produced no answers, until it was noticed that one of the switches in the new upper rows was set.

The table never listed any updates but this was a variable that in the old system was read as 0, and used as an address initialisation to the boot program. The uninitialised variable (value not stated in the table) threw this off and it tried to boot from the wrong area.

So I guess you can get an uninitialised variable, even below the level of assembler (at the label printers).

Martin had a story where RC oscillators and uninitialized variables interacted:

In a rather small project I used a small 8-Bit microcontroller. I didn't use a compiler that was compliant to C-Standards, because its startup code didn't initialize the BSS. I knew that and thoroughly initialized all globals, but you know, I still forgot some.

Not really forgotten, but thinking "my code is self-initializing, since it is a counter that counts down to zero and stops at zero so I don't need to initialized these". Yes, it did count down to zero and stopped, but starting at a (not so) random value above the designed start value lead to the well known symptoms of some units working perfectly and others having subtle "effects". Took some time to sort this out - so nothing new until here.

Somewhat later, with the same system, there was a similar symptom: Many worked just fine, but a few didn't startup well in the morning, later on they worked. First thought after been through the above procedure was another forgotten initialization variable. Inspecting each global variable by looking up the map file and manually tracing the usage of each variable revealed nothing.

The system used the microcontroller's built-in watchdog for safety. The watchdog was initialized, tested (leading to a reset on purpose while starting up) and then triggered within the main program loop. Any unexpected occurrence of a watchdog or other reset event would cause the system to enter a safe state, requiring a power cycle to recover. Further debugging showed the non-working systems were entering said safe state, so they were working as designed but not as expected by the common user.

Now, one needs to know that microcontroller uses independent internal RC oscillators for the CPU clock and the watchdog clock. And there were two of these microcontrollers, loosely synchronized to each other to form a redundant system. The synchronizing mechanism caused one micro to wait for the other at one point within the main loop.

So what happened: In the morning, the temperature in the office was lower, causing the frequencies of the internal RC oscillators to drift.

That in turn caused one microcontroller to wait a bit longer for the other while they did the synchronizing. The watchdog interval was chosen quite narrow, so that delay caused the watchdog to trigger, which in turn stopped the system to the safe state. Relaxing the watchdog interval solved the issue. So don't forget to calculate for all kinds of HW caused tolerances, especially RC oscillators can have rather large tolerances in comparison with resonators or crystals.

This Week's Cool Product

Note quite a product yet, but these passive sensors change how wi-fi signals from a smart phone are reflected. They can be used as switches and controls, This reminds me of one of the all-time amazing bugs: the Soviets gave the American ambassador a replica of the Great Seal of the United States, which hung in the ambassador's library. A passive resonant cavity vibrated in sync to conversations in the room; the Soviets flooded the room with microwaves, and the cavity modulated those, which could be picked up by a receiver. It took years for the Americans to find the bug.

Note: This section is about something I personally find cool, interesting or important and want to pass along to readers. It is not influenced by vendors.

Jobs!

Let me know if you’re hiring embedded engineers. No recruiters please, and I reserve the right to edit ads to fit the format and intent of this newsletter. Please keep it to 100 words. There is no charge for a job ad.

Joke For The Week

Note: These jokes are archived here.

Auto-correct has become my worst enema.

Advertise With Us

Advertise in The Embedded Muse! Over 28,000 embedded developers get this twice-monthly publication. .

About The Embedded Muse

The Embedded Muse is Jack Ganssle's newsletter. Send complaints, comments, and contributions to me at jack@ganssle.com.

The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster.