Debugging ISRs - Part 2

This is part 2 of a two part series on debugging interrupt service routines. Part 1 is here.

Published in Embedded Systems Programming, June, 1996

By Jack Ganssle

Last month I ambitiously attacked the subject of debugging Interrupt Service Routines (ISRs). This infinitely deep subject is worthy of a book! maybe a Britannica-sized 20 volume set would be appropriate. Still, here's another stab at covering some of the more common problems.

Debugging INT/INTA Cycles

Before you can debug your ISR, the processor must accept the interrupt and properly vector to the handler. Most processors service an interrupt with the following steps:

The interrupt controller (if any) prioritizes multiple simultaneous requests, and issues a single interrupt to the processor
The CPU responds with an interrupt acknowledge cycle
The controller drops an interrupt vector on the databus
The CPU reads the vector, and computes the address of the user-stored vector in memory. It then fetches this value.
The CPU pushes the current context, disables interrupts, and jumps to the ISR

Interrupts from internal peripherals (those on the CPU itself) will generally not generate an external interrupt acknowledge cycle. The vectoring is handled internally and invisibly to the wary programmer, tools in hand, trying to discover his system's faults.

A generation of structured programming advocates has caused many of us to completely design the system and write all of the code before debugging. Though this is certainly a nice goal, it's a mistake for the low level drivers in embedded systems. I believe in an early wrestling match with the system's hardware. Connect an emulator, and exercise the I/O ports. They never behave quite how you expected. Bits might be inverted or transposed, or maybe there's a dozen complex configuration registers needing setup. Work with your system, understand its quirks, and develop notes about how to drive each I/O device. Use these notes to write your code.

Similarly, start prototyping your interrupt handlers with a hollow shell of an ISR. You've got to get a lot of things right just to get the ISR to start. Don't worry about what the handler should do until you have it at least being called properly.

Set a breakpoint on the ISR. If your shell ISR never gets called, and the system doesn't crash and burn, most likely the interrupt never makes it to the CPU. If you were clever enough to fill the vector table's unused entries with pointers to a null routine, watch for a breakpoint on that function. You may have misprogrammed the table entry or the interrupt controller, which would then supply a wrong vector to the CPU.

If the program vectors to the wrong address, then use a logic analyzer or emulator's trace to watch how the CPU services the interrupt. Trigger collection on the interrupt itself, or on any read from the vector table in RAM. You should see the interrupt controller drop a vector on the bus. Is it the right one? Maybe the interrupt controller is misprogrammed.

Within a few instructions (if interrupts are on) look for the read from the vector table. Does it read from the right table address? If not, and if the vector was correct, then you are either looking at the wrong system interrupt, or there's a timing problem in the interrupt acknowledge cycle. Break out the logic analyzer and check this carefully.

Hit the databooks and check the format of the table's entries. On an x86-style processor, four bytes represent the ISR's offset and segment address. If these are in the wrong order -- and they often are -- there's no chance your ISR will execute.

Frustratingly often the vector is fine; the interrupt just does not occur. Depending on the processor and peripheral mix, only a handful of things could be wrong:

Did you enable interrupts in the main routine? Without an EI instruction, no interrupt will ever occur. One way of detecting this is to sense the CPU's INTR input pin. If it's asserted all of the time, then generally the chip has all interrupts disabled.
Does your I/O device generate an interrupt? It's easy to check this with external peripherals.
Have you programmed the device to allow interrupt generation? Most CPUs with internal peripherals allow you to selectively disable each device's interrupt generation; quite often you can even disable parts of this (like, allow interrupts on "received data" but not on "data transmitted").

Modern peripherals are often incredibly complex. Motorola's TPU, for example, has an entire book dedicated to its use. You could teach an entire one semester college course about this part! Set one bit in one register to the wrong value, and it won't generate the interrupt you are looking for.

It's not uncommon to see an interrupt work perfectly once, and then never work again. The only general advice is to be sure your ISR re-enables interrupts before returning. Then look into the details of your processor and peripherals.

Some, like the Z80, have an external interrupt daisy chain that serves as a priority encoder. Look at these lines with a scope. If you see the daisy chain set to a zero, it's a sure indication that one device did not see the end-of-interrupt sequence. On the Z80 and Z180 processors this is provided by executing the RETI instruction. A simple RET, mixed with use of the daisy chain, will block of an interrupt after it happens once.

Intel's x86 family is often used with an 8259 interrupt controller. Some of the embedded CPUs in this family have 8259-like controllers built into the processor. If you forget to issue an EOI (end of interrupt) command to the 8259 when the ISR is complete, you'll get that one interrupt only.

You may need to service the peripherals as well before another interrupt comes along. Depending on the part, you may have to read registers in the peripheral to clear the interrupt condition. UARTs and Timers usually require this. Some have peculiar requirements for clearing the interrupt condition, so be sure to dig deeply into the databook.

Debugging Speed Problems

If the ISR is not fast enough your system will fail. Unfortunately, few of the developers I talk to have any idea what "fast enough" means. Unless you generate the interrupt map I've discussed, only random luck will save you from speed problems.

When designing the system answer two questions: how fast is fast enough? How will you know if you've reached this goal? Some people are born lucky. Not me. I've learned that nature is perverse, and will get me if it can. Call it high tech paranoia. Plan for problems, and develop solutions for those problems before they occur. Assume each ISR will be too slow, and plan accordingly.

A performance analyzer will instantly show the minimum, maximum, and average execution time required by your code, including your ISRs. There's no better tool for finding real time speed issues.

Not everyone has an analyzer. You can instrument your code to make it "scopeable". Set a bit to a one when the ISR starts, and set it to zero when it completes. Connect a scope and measure how long the bit says up. If the routine can run for varying lengths of time, use a digital scope set to accumulate sweeps, and watch for the longest iteration.

It's important to look at total interrupt overhead in a system as well. If your ISR runs in 100 microseconds, but gets invoked 10,000 times/second, there's serious trouble brewing. Watch how long the bit stays asserted over long periods of time - a second or more - and make sure it's not eating most of the CPU resources.

Set and reset this bit in all of the ISRs to see total interrupt overhead. It's sometimes frightening to see just how close to the wire some systems run!

Too many developers fall into the serendipity school of debugging. They feel that if the system works and meets external specifications, it's ready to ship. Wrong. Hardware engineers stress their creations by running them over a temperature range. We should do the same, instrumenting our code or otherwise using performance-measuring tools, to be quite sure the system has sufficient margins designed in.

Debugging Missing Interrupts

A device that parses a stream of incoming characters will probably crash very apparently if the code misses an interrupt or two. One that counts interrupts from an encoder to measure position may only exhibit small precision errors, a tough thing to find and troubleshoot.

Having worked on a number of systems using encoders as position sensors, I've developed a few tricks over the years to find these missing pulses. It's never easy.

You can build a little circuit using a single up/down counter that counts every interrupt, and that decrements the count on each interrupt acknowledge. If the counter always shows a value of zero or one, everything is fine.

Most engineering labs have counters - test equipment that just accumulates pulse counts. We have a scope that includes a counter. Use two of these, one on the interrupt pin and another on the interrupt acknowledge pin. The counts should always be the same.

You can build a counter by instrumenting the ISR to increment a variable each time it starts. Either show this value on a display, or probe the variable using your debugger.

If you know the maximum interrupt rate, use a performance analyzer to measure the maximum time in the ISR. If this exceeds the fastest interrupts, there's very likely a latent problem waiting to pounce.

Most of these sorts of difficulties stem from slow ISRs, or from code that leaves interrupts off for too long. Be wary of any code that executes a disable-interrupt instruction. There's rarely a good need for it; this is usually an indication of sloppy code.

It's rather difficult to find a chunk of code that leaves interrupts off. The ancient 8080 had a wonderful pin that should interrupt state all of the time. It was easy to watch this on the scope and look for interrupts that came during that period. Now, having advanced so far, we have no such easy troubleshooting aids. About the best one can do is watch the INTR pin. If it stays asserted for long periods of time, and if it's properly designed (i.e., stays asserted till INTA), then interrupts are certainly off.

Be sure to re-enable interrupts in your ISRs at the earliest safe spot.

Debugging Reentrancy problems

Well designed interrupt handlers are largely reentrant. Reentrant functions, AKA "pure code", are often falsely thought to be any code that does not modify itself. Too many programmers feel if they simply avoid self-modifying code, then their routines are guaranteed to be reentrant, and thus interrupt-safe. Nothing could be further from the truth.

A function is reentrant if, while it is being executed, it can be re-invoked by itself, or by any other routine.

Suppose your main line routine and the ISRs are all coded in C. The compiler will certainly invoke runtime functions to support floating point math, I/O, string manipulations, etc. If the runtime package is only partially reentrant, than your ISRs may very well corrupt the execution of the main line code. This problem is common, but is virtually impossible to troubleshoot since symptoms result only occasionally and erratically. Can you imagine the difficulty of isolating a bug which manifests itself only occasionally, and with totally different characteristics each time?

Now, sometimes we're tempted to cheat and write a nearly-pure routine. If your ISR merely increments a global 32 bit value, say, to maintain time, it would seem legal to produce code that does nothing more than a quick and dirty increment. Beware! Especially when writing code on an 8 or 16 bit processor, remember that the C compiler will surely generate several instructions to do the deed. On a 186, the construct ++j might produce:

	mov	ax,[j]
	add	ax,1		; increment low part of j
	mov	[j],ax
	mov	ax,[j+1]
	adc	ax,0		; prop carry to high part of j
	mov	[j+1],ax

An interrupt in the middle of this code will leave j just partially changed; if the ISR is reincarnated with j in transition, its value will surely be corrupt.

Watch out for noise on the NMI line. NMI is usually an edge-triggered signal. Any bit of noise or glitching will cause perhaps hundreds of interrupts. Since it cannot be masked, you'll almost certainly cause a reentrancy problem. This is yet another reason to avoid NMI for anything other than a catastrophic failure.

Even the perfectly coded reentrant ISR leads to problems. If such a routine runs so slowly that interrupts keep giving birth to additional copies of it, eventually the stack will fill. Once the stack bangs into your variables the program is on its way to oblivion. You must ensure that the average interrupt rate is such that the routine will return more often than it is invoked.

Debugging Stack Problems

Any of a number of problems can cause the stack to grow to the point where the entire system crashes. It's tough to go back and analyze the failure after the crash, as the program will often write all over itself or the variables, removing all clues.

The best defense is a strong offense. Build a stack monitor into your code.

A stack monitor is just a few lines of assembly language that compares the stack pointer to some limit you've set. Estimate the total stack use, and then double or triple the size. Use this as the limit.

Put the stack monitor into one or more frequently called ISRs. Jump to a null routine, where a breakpoint is set, when the stack grows too much.

Be sure that the compare is "fuzzy". The stack pointer will never exactly match the limit.

By catching the problem before a complete crash, you can analyze the stack's contents to see what lead up to the problem. You may see an ISR being interrupted constantly (that is, a lot of the stack's addresses belong to the ISR). This is a sure indication of code that's too slow to keep up with the interrupt rate. You can't simply leave interrupts disabled longer as the system will start missing them. Optimize the algorithm and the code in that ISR.

Conclusion

I've made a number of recommendations, most of which fall into a philosophy of debugging: plan for bugs, instrument your code to find them, and buy the right tools.

Someday we'll all write bug-free code. Till then, debug proactively. Anticipate the problems, and design in test code and solutions from the outset.