Pipelines and Prefetchers
All modern processors use pipelines and/or prefetchers to increase performance. Here's how they work.
Published in Embedded Systems Programming, June 1991
For novel ideas about building embedded systems (both hardware and firmware), join the 35,000 engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.
By Jack Ganssle
There's more than "crude" in them thar pipelines. While Congress fiddles with trying to keep oil pipelines full, those of us building the information economy realize that bandwidth is a far more valuable commodity, though one that will probably never generate the same sort of frenzied trading in the pits of Chicago.
High performance systems are ultimately bound by memory speed. The economics are compelling: it's pretty easy (where speed is an issue) to justify one relatively expensive super-fast CPU. Equally fast memory is generally out of the question, since hundreds or even thousands of chips make up the storage array.
Usually memory is the high performance system's biggest bottleneck. Every single instruction must be fetched from the memory every time it is needed. When the CPU is faster than the RAM chips, the connection between them is the limiting performance parameter.
RF engineers use the word "bandwidth" to describe the information capacity of a radio channel. The stink over HDTV is essentially a bandwidth issue: how can one cram additional picture information onto the current 6 Mhz TV channel allocation? Bandwidth is just as big an issue in computer systems, where we transfer millions of bytes/second over a data bus.
Computer history is filled with innovative and sometimes desperate attempts to maximize the bandwidth of the CPU-to-memory connection. Back in the 60's megabuck machines relied on interleaved core. All even addresses went to one physical memory bank, and the odds went to another. The operation of each bank could run in parallel, effectively doubling the bus rate.
Now PCs and workstations rely on huge caches, which are nothing more than RAMs stuck between the main memory and the CPU. The cache RAM runs at the same speed the processor does, pushing the bandwidth limitations further away from the computer. If the next instruction needed is in cache, the CPU can fetch it without waiting. As soon as it needs one outside of the cached region the entire computer effectively comes to a halt as it waits for the relatively slow memory to cycle.
Conventional (i.e., CISC) processors make very strange demands on the memory array. When executing short instructions like register to register ADDs, the CPU is usually much quicker than memory and sucks instructions as fast as the RAM can supply them. Frequently the CPU will slow down as it handles a slow instruction. It might take a number of machine cycles to handle a complex addressing mode, or perhaps hundreds to do a multiply. During these periods the bus is idle.
I believe the 8088 was the first widely available microprocessor to address this issue. Intel divided the CPU in two. An Execution Unit contained all of the traditional processor-like features. The Bus Interface Unit (BIU) sat between the Execution Unit and the device's pins. If the 8088 were a simple 8080 or 6800, the BIU would just pass memory requests from the Execution Unit to the outside world. However, Intel decide to exploit the idle times resulting from the execution of long instructions by adding intelligence to the BIU.
Code generally executes from a low address to a bigger one. Sure, jumps and calls reverse the monotonically increasing fetch sequence, but even in a short loop more often than not the next instruction byte is located right after the one just fetched. The 8088's BIU used this fact to keep the CPU to memory interface busy (i.e., maximize the bandwidth).
The BIU is really rather stupid. It just blindly keeps fetching the next byte from RAM, storing it in a little FIFO between it and the Execution Unit. When the EU is ready for the next instruction, with luck it is already on-chip in the fast FIFO. A lot of memory fetch delays are thus avoided, greatly increasing the processor's throughput. The FIFO is small, holding only 4 bytes (6 on the 8086), but the typical match between RAM speed, CPU speed, and FIFO size is such that most of the time the next byte is there when needed. Not surprisingly, this process is called prefetching.
Once in a while the CPU will decode and execute some sort of branch operation. Presumably the BIU will have at least partially filled the FIFO with bytes located sequentially beyond the jump; bytes that just are not needed. Jumps and calls flush the FIFO, erasing these unneeded entries. The BIU then starts fetching from the new execution address. Program transfers therefore essentially stall the CPU; the processor must wait for the first instruction at the jump destination before proceeding, just like any simple non-prefetching computer would. Soon, however, the BIU will again fill the FIFO, keeping a bit ahead of the EU's needs.
Some instructions read or write data to the memory array. Non-Harvard machines (like the 8088) share one data bus for both instructions and data, so a load or store operation causes a momentary disruption in normal prefetching sequence. Loads and stores work around the FIFO; they temporarily suspend prefetching, transfer the data, and then resume without corrupting the FIFO's contents.
In general, a deeper FIFO lets the BIU get further ahead of the Execution Unit, maximizing the memory channel bandwidth. It's hard to say just what the most efficient FIFO size is. Different CPUs have different depths; presumably this reflects the design philosophies of the vendors (or perhaps just their ability to cram more transistors on a die).
Some microprocessor go a step further and incorporate a so-called "Pipeline". The pipeline is similar to a prefetcher, in that it blindly and dumbly reads ahead, trying too fill a queue with sequential instructions that might be needed. It offers one
I'll present my Better Firmware Faster seminar in Melbourne and Perth, Australia February 20 and 26th. All are invited. More info here. The early registration discount ends January 20.
RISC processors depend on pipelines to meet their much vaunted one clock per instruction specification. Consider the R3000. This nifty RISC processor includes a 5 stage pipeline, segmented into the following operations:
Fetch the instruction Read operands from registers and decode instruction Do the operation (like ADD, OR, etc) Access memory, if needed Write results back to CPU registers
The R3000's pipeline is a 5 stage affair, each stage corresponding to one of the above states. At any time the CPU is working on 5 instructions at the same time: an instruction is being handled in each stage. After every clock all 5 march ahead to the next stage.
This arrangement permits an effective execution rate of one instruction every clock, but the first one takes 5 clocks to complete. It's rather like the shipyards during World War 2 that cranked out a ship every 48 hours. Each ship took months to build, but many were in various stages of completion.
Prefetchers and pipelines are truly wonderful speed-enhancing features for any computer. Be aware, however, that in some conditions their performance is less than thrilling. Rarely, though, will the addition of either result in code that runs slower than on a conventional fetch-then-execute machine (like the Z80).
Remember that both pipelines and prefetchers (which I'll just call prefetchers here unless a distinction is really needed) flush their contents whenever a program branch occurs. Obviously this happens for all jumps, calls, and returns. Perhaps not so obviously, every interrupt will flush them. Thus, the prefetcher offers no improvement for interrupt latency, a big problem in embedded systems.
Some CPUs have multiple address spaces. Both the 68000 and Z280 include user and system areas that are quite different. Be careful about changing modes (say, from system to user) without executing some sort of jump that will clear the prefetcher. A simple mode transfer might not cause a flush, leaving incorrect bytes behind.
Conditional jumps may behave a bit differently than you are used to. The 64180 doesn't have a prefetcher of any sort, yet if it knows it will not take a conditional jump (it can decide this after reading the first byte), then it will not read the third instruction byte. This frees up some memory bandwidth, and is a nice, intelligent way to deal with conditionals. A prefetching processor will usually read all of the bytes before making the conditional jump decision, so some implicit bandwidth is lost.
Prefetcher stall is a very real problem. All memory bus activity stops as the CPU goes to RAM for the argument of a load or store operation. If the RAM asserts wait states, the entire system, including prefetcher, is shut down for the memory transaction. Where speed is a big problem and prefetcher stalls cannot be tolerated, put frequently used variables in very fast memory.
Debugging with a prefetching processor can be the stuff of horror movies. Don't figure on using a logic analyzer to capture the execution stream. The prefetcher grabs instructions in a stupid, monotonically increasing sequence of addresses. Branches or other program transfers cause this to abruptly halt and restart at the destination address. As a result, some instructions will only be partially fetched, completely confusing the analyzer's disassembler. Worse, an instruction is fetched long before it is executed. If it transfers data the load/store argument will appear in the trace data quite a bit after the instruction proper. How will you align arguments with instructions?
Suppose you use a logic analyzer to trigger a pseudo breakpoint. The analyzer only sees bus transactions, so it will break on the instruction fetch, not its execution. This is simply wrong, since the fetched instruction is often discarded when the prefetcher is flushed after a jump. For example:
loop: do something jnz loop add bx,cx
If you wish to break when the loop completes, a breakpoint on the ADD instruction should do the trick - shouldn't it? Actually, a logic analyzer will trigger the break condition every time the jump executes, since the prefetcher will surely read the ADD. Some tools will handle this properly, but be sure to check with the vendor.
Be wary of especially smart CPUs that have prefetchers - particularly those with Memory Management Units. Suppose the MMU can produce an address violation error if some illegal location is referenced. If code is immediately adjacent to an illegal area, the prefetcher might read ahead into the bad zone and create a false trigger of the violation trap. Even without an MMU this is a problem if your system uses a watchdog circuit that monitors the address bus.
Similarly, be careful about memory mapped I/O when using a prefetcher. Code that runs right next to an I/O port could cause an extraneous prefetch read of the port.
Avoid all of these problems by keeping your code at least one prefetch depth away from the end of a memory protection segment, or from memory mapped I/O addresses.
Finally, be careful about self modifying code. If you really need to use it (and yes, in the embedded world it is sometimes awfully hard to avoid, despite the hysterical pleadings of an army of structured programming advocates) be sure that when you modify the code in memory you don't wind up with an unmodified copy in the prefetch queue. I've been burned by this - it's not easy to find.
Though I've drawn a distinction between the terms "pipeline" and "prefetcher", most designers use them interchangeably. The engineers at Zilog are always quick to correct my use of "prefetcher" to "pipeline" when talking about their Z280. Intel is just as adamant to insist I call their 80x88 implementation a prefetcher. Perhaps I'm just an easy target. Is there a true industry-accepted distinction between the terms? My definitions are not from Webster, but gleaned from studying many vendor manuals.