The Embedded Muse 275

Go here to sign up for The Embedded Muse.

The Embedded Muse
Issue Number 275, January 5, 2015
Copyright 2015 The Ganssle Group

Editor: Jack Ganssle, jack@ganssle.com

Jack Ganssle, Editor of The Embedded Muse

You may redistribute this newsletter for noncommercial purposes. For commercial use contact jack@ganssle.com.

Contents

Editor's Notes
Quotes and Thoughts
Tools and Tips
The Write-Only Memory's Back Story
Building Ultra-Low Power Systems
An Interesting Bug
Billions and Billions of Transistors
Jobs!
Joke for the Week
Advertise with us
About The Embedded Muse

Editor's Notes

Did you know it IS possible to create accurate schedules? Or that most projects consume 50% of the development time in debug and test, and that it’s not hard to slash that number drastically? Or that we know how to manage the quantitative relationship between complexity and bugs? Learn this and far more at my Better Firmware Faster class, presented at your facility. See https://www.ganssle.com/onsite.htm.

Quotes and Thoughts

While we all know that unmastered complexity is at the root of the misery, we do not know what degree of simplicity can be obtained, nor to what extent the intrinsic complexity of the whole design has to show up in the interfaces. We simply do not know yet the limits of disentanglement. We do not know yet whether intrinsic intricacy can be distinguished from accidental intricacy. - E. W. Dijkstra

Tools and Tips

Please submit clever ideas or thoughts about tools, techniques and resources you love or hate. Here are the tool reviews submitted in the past.

The Write-Only Memory's Back Story

Many are familiar with Signetic's fake Write-Only Memory device, invented back in 1972. A datasheet was produced which gave important specs, like cooling requirements (a six-foot fan one inch away), and graphs showed details like how many pins remained versus number of insertions into a socket.

I had a copy of the datasheet (since given away) and posted it on my site. It was a bit messy, and someone gave me .PDFs of a cleaner copy. Turns out, those .PDFs are counterfeit! I can't imagine why anyone would fake a fake datasheet, but John Curtis, the original author of the datasheet, contacted me recently and gave some of the back story.

My site once again hosts the original, correct, fake datasheet here.

Building Ultra-Low Power Systems

The Linley Group just released a report (Achieving Energy Efficiency With EFM32 Gecko Microcontrollers) which has an interesting description of the low-power features of Silicon Labs' EFM32 parts. But, it contains the same sort of misinformation - or, at best, hopelessly optimistic claims - that permeates most discussions about this topic.

Some MCU vendors are making what I consider outrageous assertions about the ability of their MCUs to run from coin cells for years and even decades. They take a "typical" power consumption number (which in some cases is orders of magnitudes better than worse-case), assume there's no other circuitry involved, and then do a poor analysis of real-world operating conditions.

I spent a good chunk of 2013 and 2014 studying the subject and ran numerous experiments to tease out the facts in building systems that have to run for years from a battery. The results were published in a number of articles on embedded.com. Most of what has been written by vendors about this topic turns out to be wrong or naive. There's essentially no way one can build systems that will run reliably for a decade or longer from a coin cell. Even picking the wrong decoupling capacitor will cause the system to fail in a year or two. Most MCU on-board brown-out reset circuits will greatly shorten available battery life. There's a whole host of quite interesting problems and solutions to using coin cells with microcontrollers.

The Linley report inspired me to digest those articles into a long article on the subject. Check out Hardware and Firmware Issues in Using Ultra-Low Power MCUs for the real scoop.

An Interesting Bug

It's fun and useful to learn from the experience of others. Bernard Nahas sent in this interesting bug:

During the break at your course, I mentioned a bug in the floating point context switching functionality in some operating systems running on PPC. Here is a slightly more detailed description. I tried to remember the details, but it's been over 4 years. It's hairy, so I suggest you get a cup of coffee before you start reading.

*** The symptoms ***
The system we were running used multiple processes running in different protected address spaces on a single CPU power pc based chip (MPC8360 e300). The problem rarely manifested itself. But occasionally after running for a few weeks, some floating point operations would give results that didn't make any sense. What was most puzzling is this issue did not happen at the same place or in the same module, but at various places in the code.

*** Reproducing the problem ***
I somehow managed to reproduce the issue by running multiple instances of a test program simultaneously. The test program was simple, it ran in a while(1) loop, incremented a floating point value and then verified that the result incremented correctly. The only unusual thing was that it disabled interrupts during the increment and verify section, and it re-enabled them after:
while 1:
   a- Disable Interrupts
   b- Increment my floating point variable
   c- Check that my variable incremented
   d- Enable Interrupts
When I ran 1 instance of this program, it would run fine. If I ran 10 simultaneous instances of this program, within 15 minutes, one of the processes would detect an error in the floating point increment operation.

Disabling interrupts is usually bad practice for real time systems, but this was legacy code and it was a little risky to remove it all. Also, in your course, you talked about the correct way to restore interrupts, which restores the previous interrupt state before your critical section.

But aside from these issues, my simple program should not result in any memory or register corruption.

*** Root cause ***
Digging deeper, it turned out that the floating registers of one of the processes were not being restored by the OS after a context switch. This was triggered by an OS optimization in combination with a race condition:
- The race condition was in the enable/disable interrupt call (lines a and d above).
- The optimization was the use of a technique called lazy floating point restore. This technique does not restore the floating point registers immediately after a context switch, but waits until the first floating point instruction is executed.

*** Disable/Enable Interrupt Race Condition *** The disable/enable interrupt call was not atomic on this HW architecture and required a read/modify/write to set or clear a single bit.

To disable interrupts, you had to clear a bit (EE - External Interrupt Enabled) in the MSR (machine state register).
To read and write the MSR, there were two instructions: mfmsr and mtmsr (Move From and Move To) So, disabling interrupts required three separate instructions:
1- mfmsr RegX # Move MSR to RegX
2- clear the interrupt enable bit in RegX
3- mtmsr RegX # Write the modified RegX back to MSR with Interrupts disabled If your process gets pre-empted by the OS after instruction 1, but before instruction 3, we would lose any changes that the OS had made to our MSR.

*** Lazy floating point restore ***
When switching processes, the OS saves all registers of the outgoing process in the process's context structure and loads the incoming processes' registers. In many applications, the floating point registers are not used. Saving and restoring these floating point registers could be time consuming. As an optimization, the lazy floating point mechanism allows the OS to switch processes without saving or restoring the FP registers immediately, but instead to wait until right before the first floating point instruction is executed.

To implement this, the processor can generate an interrupt to the OS when a floating point instruction is executed. This interrupt is controlled by the FP bit of the MSR: FP is floating point available state. If this bit is 0, then a special FP interrupt is triggered when a floating point instruction is executed. If it is 1, the floating point instruction executes as usual and no interrupt is generated.

The idea is that on context switch the OS switches all registers except the floating point ones, and sets FP to 0. If the OS gets an FP interrupt, that means the new process needs its floating point registers restored. At that time, we save the existing floating point registers to the previous process context struct and restore the active process' floating point registers. Then we set FP to 1 so that subsequent floating operations execute normally.

This clearing of FP to 0 repeats every time a process is switched by the OS to delay swapping the floating point registers until they are needed by the active process.

*** The bug ***
Going back to my test program:
while 1:
   a- Disable Interrupts:
       i-   mfmsr Rx
       ii- Clear bit EE in Rx
       iii- mtmsr Rx
   b- Increment my floating point variable
   c- Check that my variable incremented
   d- Enable Interrupts

If we get pre-empted between a-i and a-iii, we saved MSR to a Register for the purposes of clearing the interrupt enable bit. Unfortunately, as part of reading the MSR, it also read the FP bit.

The problem would arise when the FP bit is set to 1 and we get pre-empted. When the OS gives us back CPU focus, it clears the FP bit and does not immediately restore the floating point registers.

We resume execution at a-iii. This clobbers the MSR register with what was there before we got pre-empted. This means that the MSR:FP bit will be set to 1, even though the OS cleared it when giving us CPU focus. As a result, the OS will NOT get notification when we execute our first floating point instruction, and will not restore our floating point registers. We end up using floating point registers from the previously running process. Ouch.

A couple of non-ideal workarounds were:
- Always set FP to 0 after enabling or disabling interrupts
- Disable the lazy floating point context switch optimization

Technically, any modification of the MSR could cause problems during context switches, but in this case, it was the FP bit that got clobbered.

Billions and Billions of Transistors

As Paul Simon put it, we live in an age of miracles and wonders, and certainly the electronics industry is the source of plenty of these. I bought a 128 GB flash drive recently. It cost $45. Amazing. Here's a picture of it on top of a core memory array. The core, which I bought from a surplus store in 1971, is 26 planes of 2K bits each for a total of 50K bits. The top plane is visible in this picture and each of those little donuts holds just one bit of data. The flash drive on top holds one trillion bits.

Core memory and a flash drive

How do they do this? It was impossible to resist the temptation to open the thing up, and these two pictures show the top and bottom of the PCB:

Top of flash drive

Bottom of flash drive

The two big chips, one on top and one on bottom, are SanDisk SDIN5C4-64G devices, which each hold 64 GB of data. The datasheet unsurprisingly notes these are MLC devices, which hold more than one bit per transistor, but doesn't say how many levels each transistor can handle. I suspect it's a 4 level device since SanDisk announced that technology in 2009. That means each chip has over 100 billion transistors. Wow!

The bottom picture also shows another chip, which is an NS1081 (from Norelsys) USB controller. The top board has a part labeled "AGHM L422" which I can't find any information about. But that's it for the electronics, other than some capacitors and a 25 MHz oscillator.

Jobs!

Let me know if you’re hiring embedded engineers. No recruiters please, and I reserve the right to edit ads to fit the format and intents of this newsletter. Please keep it to 100 words.

Joke For The Week

Note: These jokes are archived at www.ganssle.com/jokes.htm.

Martin Ostermeyer sent this: A friend of mine once did telephone troubleshooting for his computer-illiterate friend. Problem: the cursor was somehow 'broken'.

It took a while to find out that the mouse has to be used 'tail up', not 'tail down'.

Advertise With Us

Advertise in The Embedded Muse! Over 23,000 embedded developers get this twice-monthly publication. .

About The Embedded Muse

The Embedded Muse is Jack Ganssle's newsletter. Send complaints, comments, and contributions to me at jack@ganssle.com.

The Embedded Muse is supported by The Ganssle Group, whose mission is to help embedded folks get better products to market faster.