Volume 2, Number 6 Copyright 1997 TGG September 4, 1997
You may redistribute this newsletter for noncommercial purposes. For commercial use contact firstname.lastname@example.org.
EDITOR: Jack Ganssle, email@example.com
- Editor’s Notes
- More Dumb Mistakes
- Embedded Seminar in Boston
- Thought for the Week
- About The Embedded Muse
For those of you interested in the “The 24 Best Ideas for Developing Better Firmware Faster” seminar in Boston on September 18, see the info later in this newsletter. It’s just about booked up.
Thanks to all of the readers who have submitted ideas for the “Dumb Mistakes” series. It’s quite interesting that almost all of the submissions to date are hardware related. Is software immune from these sorts of errors? Or, are the software mistakes soooooo dumb that we’re afraid to admit them? This issue consists of more contributions from readers all over the world, from New Zealand to Croatia - thanks, all.
More Dumb Mistakes
From: Jonathan Marks in New Zealand
One dumb mistake that I made when cutting my teeth on embedded work and have seen very often since is: Placing the code that kicks the watchdog in the timer tick ISR, and then wondering why the program never resets when it goes off to never never land.
How many of us out there are willing to admit we made this sort of mistake?
From: Georg Christen in Germany
Although I'm working in technical management, it is sometimes helpful to not have forgotten everything one has learned at university some time ago.
We once developed an embedded DSP system, which was using serial communication lines to talk with the outside world. The serial interface chips were very simple, maybe 12 pins or so, and there was a pretty good application example in the data sheet of those devices. Everything worked fine, however from time to time (absolutely NOT deterministic, of course), the serial communication froze. The hardware designer blamed the software, the software designer couldn't find a fault in his software, and thus it went from left to right "Maybe the software does this or that" (hw guy) or "Probably the I/O chip just freezes" (sw guy). When we found that error, we actually had one of those boards out at customer (astonishingly, he didn't find the error!), and time was running short to identify and fix the problem!
So, I took the two designers, and we went for a long long debugging session. Nothing. Finally I went through the I/O software code, didn't see any major thing. Then we went through the schematics, checked the wiring (it was a wire-wrap board). Nothing. Night had arrived and I ordered Pizza for all of us. While waiting for the Pizza man to arrive, I browsed through the data sheet of the infamous I/O chip. Two pins were labeled "CTG". OK, so where do those pins go? Ah, they are not connected. "Tell me, why aren't they connected?" - "The data sheet doesn't say anything about those pins, and the application example doesn't use them either, so I left them open". Aha. There was a tiny footnote on the data sheet, which said "CTG: Connect to Ground. If not connected to ground, the device may fail intermittently or develop an undeterministic behavior". So as soon as we connected those pins to ground, the system worked like a charm. When the pizza arrived, everything was OK and everybody was happy (not least of the pizza).
The morale? Read everything (yes: everything!) in a data sheet. Connect all pins of a device. Yes, all! Ordering pizza can be a good idea! (EDITOR’S NOTE: I like that idea!) And, talk with people, even if they are only marginally involved with the project. Sometimes, a designer (hw or sw) can be a bit "blind", and heavily biased.
We got the customer board exchanged without him noticing it, but that's another story.
From: T.N.Kishore Reddy in The Netherlands
Here is one such mistake of the numerous ones that I made, but by far the most costly one...!!
I was woking in a Test & Measuring Instruments Division of a large organisation, working on industrial trivector meters. In this particular product, apart from logging the consumption of power, keeping track of the failures and tamper of the meter is a big issue. In our product we used to use Motorola's time keeper IC which had to be kept running in case of power failures by the help of a Lithium battery backup. When the silicon industry boom made a NVRAM with battery back up and time keeper built in, we decided to go for this IC from SGS-Thompson and just made a minor modifications to the board and plugged in this IC in place of Motorola's. We decided that there was no need to test the s/w as the IC placed is fully compatible with the replaced IC.
We sent the meters to the customer and from the very next day of installation, we started getting reports of erroneous power failure indications by the meter. We brought the meter back and analysed the problem and much to my embarrassment, there is a line in the data sheet of this IC, stating that the time cannot be read from the real-time-clock for one second from the time of power on. As there was no such condition in the earlier chip, the s/w was written in such a way that the time was read immediately on powering on the system. So when we read the time, initially on power on, it was not yet updated and the time was same as the time during which the power failed...!! So the power fail time and power on time remained the same and the power fail duration became 'zero'…
This mistake has cost my company more than money; the customer was not very happy with the product on the day one…
From: Tim Loose in the USA
I wanted to share with you one of my "learning experiences" from my first engineering job.
I was working for a telecommunications company. My first independent assignment involved a panel with 24 LED's to display the status of 24 telephone lines. I devised a clever scheme that reduced the number of driving transistors; unfortunately, the prototype kept blowing out the LED's. I eventually determined that the LED's could not withstand the +48 VDC in a reverse biased condition; this is not a spec prominently displayed on the spec. sheet, so I assumed it would work. The first big "learning experience".
The trick is to minimize the unnecessary "learning experiences" and keep learning from the unavoidable ones.
This didn't happen to me, and it's a recollection from about seven years back so, like a fine wine, it has probably improved with age but the core story is true. Names have been changed not to protect the guilty but because I've forgotten...
Chuck and Dave were good friends and had worked with each other on many projects. They were working on the control and motor drive circuit for an anti-aircraft gunnery turret. This is a formidable piece of machinery: eighteen tons of steel (it had to be heavy to handle the guns firing multiple rounds per second) that could be rotated as slowly as a few milliradians per second (way less than 1 degree per second) up to three radians per second (about half a revolution per second).
There had been a report of "strange sounds at low speeds"; one of those multi-variable problems that could be software, electronic or mechanical. "Just needs some oil", said Dave.
Chuck and Dave hooked up a bunch of test gear with the idea of rotating the machine at its slowest speed and monitoring what happened. Dave got into the turret and tool hold of the diagnostic keyboard. The keyboard had a key marked "+R" which would speed up rotation towards the right. From stopped, press once for the slowest speed, press more for higher speeds up to ten for the highest speed. Check tapped the button. Nothing. Tap.... tap, tap, tap. Still nothing. "Hey Chuck [tap, tap] this button's busted [tap, tap] again", tap, tap, tap, tap, tap. "Dave you moron [they weren't into this kumbaya stuff], did you clear the breakpoint?". Dave clears the breakpoint and the command processor task immediately processes all the key strokes. Eighteen tons of steel lurches forth at full speed.
Before somebody got to the circuit breaker, tens of thousands of dollars of test gear was smashed. Dave got thrown on his back and needed stitches above his eye where he got hit by a logic analyzer. Chuck needed a while in hospital - the gun barrel had hit him on the back like an oversized baseball bat.
Another from: Charles Manning in New Zealand
When working on a double Eurocard rack system, we typically used an extender card to get access to the card being debugged. Occasionally this wasn't enough and we made up two Eurocard sockets with power etc so that we could pull the card out of the system and debug it on the bench. This went great until the day we switched the two sockets and fed -12V and +12V into the 0 and 5V connections.........
On the same project, we started using a new PCB house that had just geared up to multi-layer boards. When the boards came back, they didn't work. Buzzing it out, power was not getting onto the board and we had DIP ICs with their legs all shorted together. Funny, we could see through 6 layer boards with ground and power planes! We figured out what had happened, the PCB house had never seen ground/power planes before and figured that with all this copper, these must be negative layers, so they kindly did us the service of reversing them.
From: Dejan Durdenic in Croatia
Here's one about "never assume anything":
Some time ago I developed a PCB including 10-bit DAC from PMI. The company was bought by Analog Devices, who included all former PMI parts in AD catalogue. After PCB was assembled, test software behaved odd - instead of sawtooth waveform DAC produced a mess...I checked and checked and finally discovered that , for some reason, PMI had marked DAC's MSB as D1 and DAC's LSB as D10 ( most of other parts mark those as D9 and D0 ...) The problem was detected, but PCB was done...so I created an bit-swap look-up table (faster than subroutine) and fixed hardware bug. Moral: ALWAYS check data sheets thoroughly!
The other one is more subtle...
I developed a PCB around TMS320C31 DSP. I had some RAM and some glue logic around. The logic was done with ALTERA EPLD's ( very fast ones - that's important). After the prototype was assembled and test software written, some strange things start to occur. It seemed that , for some reason , program counter "rolled around" and that program was started again. I used 100MHz scope to check if there were any unwanted reset pulses - there were none. Then I used JTAG debugger to check register values but registers which should be cleared after hardware reset, kept their values - I concluded that CPU fetched some incorrect data from RAM and jumped back to start or something like that. I spent whole week searching, but everything failed. To be even worse - situation was not exactly repeatable. Sometimes it happened immediately, sometimes program run for several seconds. I was desperate and returned back to good, old scope. Then - there it was! I connected the probe to RESET line and set trigger to falling edge ( It was TDS320 TEK scope - a digital one with very good trigger circuit) and sometimes trigger responded even there seemed to be no change on the line !!! Scope's bandwidth was too low for glitch to be seen but trigger captured it. The glitch was so short that it partially reseted CPU (only PC was cleared) ! RESET pulse was created in a EPLD and as it was extremely fast (7nS) glitch was fast too...Then I recalled an old advice - never use logic family faster than your application needs!
(By the way - problem was solved with small SMT cap on the prototype on the RESET line and with better layout around EPLD in the final version)
From: David Hinerman in the USA
Here's one that my dad pulled.
Dad and I were taking a night class in printed circuit board design at the local technical college. Me because I was studying to go into electronics, Dad because it was fun. The 'final' for the course was to design, fabricate, and assemble a PC board for some small circuit. I built a crystal filter, but Dad wanted a hefty 12 VDC power supply to power his CB radio at home.
Dad did a really nice job on the board -- 2 oz. copper, 1/4 inch wide tracks for the high-current circuits, and about 25,000 microfarads worth of filter caps. He bought a really nice case from Radio Shack and mounted the board inside, with the transformer, and made a really slick front panel to cover it all.
He brought it in to class to show it off a few weeks before the course ended. We all stood around listening to the local chatter on Channel 19 when suddenly BAM! As we crawled out from under various workbenches we saw paper protruding from the vents in Dad's power supply cabinet. Dad opened it up, and the teacher pulled out the remains of the filter cap casing. The ratings were clearly readable: 25,000 uF, 16 WVDC. Dad had assumed that a 16 volt cap was plenty for a 12 volt supply.
The teacher explained to Dad about RMS and peak voltages, and capacitor charging tendencies, and all that. Dad had used an 18 volt transformer, so the teacher recommended a 35 volt cap. Dad went out the next day, bought one, and installed it that evening.
The next class, Dad brought the box in to try again. we stood around listening to his CB and making jokes about mortar fire when BAM! More confetti.
Dad had installed the new cap backwards.
From: Michael J. Schreck in the USA
Here's one for the assumptions list. I was wire wrapping a small prototype circuit in stages and testing each stage when it was finished. After wiring in the 4th stage, the previous stages stopped working. I spent 3 days with a logic probe and a scope, only to find out that I had severed the power connection to the first stage! I no longer assume that anything is getting power!
Embedded Seminar! Boston - Thursday, September 18
Last issue I mentioned a full-day embedded seminar I’ll be conducting in Boston next month. It’s called “The 24 Best Ideas for Developing Better Firmware Faster”, and is for the developer who is honestly looking for new ideas, but who wants to cut through the academic fluff of formal methodologies and immediately find better ways to work.
The focus is uniquely on embedded systems. I’ll talk about ways to link the hardware and software, to identify and stamp out bugs, to manage risk, and to meet impossible deadlines.
A few seats are still available. For more information check out http://www.ganssle.com or email firstname.lastname@example.org.
Thought for the Week
TOP TEN THINGS ENGINEERING SCHOOL DIDN'T TEACH YOU
10. There are at least 10 types of capacitors.
9. Theory tells you how a circuit works, not
why it does not work.
8. Not everything works according to the specs in the
7. Anything practical you learn will be obsolete before you
use it, except the complex math, which you will never use.
6. Always try to fix the hardware with software.
5. Engineering is like having an 8 a.m. class and a late afternoon
lab every day for the rest of your life.
4. Overtime pay? What overtime pay?
3. Managers, not engineers, rule the world.
2. If you like junk food, caffeine and all-nighters, go into
1. Dilbert is not a comic strip, it's a documentary.