Pipelines and Prefetchers
All modern processors use pipelines and/or prefetchers to increase
performance. Here's how they work.
Published in Embedded Systems Programming, June 1991
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
There's more than "crude" in them thar pipelines. While
Congress fiddles with trying to keep oil pipelines full, those
of us building the information economy realize that bandwidth
is a far more valuable commodity, though one that will probably
never generate the same sort of frenzied trading in the pits of
Chicago.
High performance systems are ultimately bound by memory speed.
The economics are compelling: it's pretty easy (where speed is
an issue) to justify one relatively expensive super-fast CPU.
Equally fast memory is generally out of the question, since hundreds
or even thousands of chips make up the storage array.
Usually memory is the high performance system's biggest bottleneck.
Every single instruction must be fetched from the memory every
time it is needed. When the CPU is faster than the RAM chips,
the connection between them is the limiting performance parameter.
RF engineers use the word "bandwidth" to describe the
information capacity of a radio channel. The stink over HDTV is
essentially a bandwidth issue: how can one cram additional picture
information onto the current 6 Mhz TV channel allocation? Bandwidth
is just as big an issue in computer systems, where we transfer
millions of bytes/second over a data bus.
Computer history is filled with innovative and sometimes desperate
attempts to maximize the bandwidth of the CPU-to-memory connection.
Back in the 60's megabuck machines relied on interleaved core.
All even addresses went to one physical memory bank, and the odds
went to another. The operation of each bank could run in parallel,
effectively doubling the bus rate.
Now PCs and workstations rely on huge caches, which are nothing
more than RAMs stuck between the main memory and the CPU. The
cache RAM runs at the same speed the processor does, pushing the
bandwidth limitations further away from the computer. If the next
instruction needed is in cache, the CPU can fetch it without waiting.
As soon as it needs one outside of the cached region the entire
computer effectively comes to a halt as it waits for the relatively
slow memory to cycle.
Conventional (i.e., CISC) processors make very strange demands
on the memory array. When executing short instructions like register
to register ADDs, the CPU is usually much quicker than memory
and sucks instructions as fast as the RAM can supply them. Frequently
the CPU will slow down as it handles a slow instruction. It might
take a number of machine cycles to handle a complex addressing
mode, or perhaps hundreds to do a multiply. During these periods
the bus is idle.
Prefetchers
I believe the 8088 was the first widely available microprocessor
to address this issue. Intel divided the CPU in two. An Execution
Unit contained all of the traditional processor-like features.
The Bus Interface Unit (BIU) sat between the Execution Unit and
the device's pins. If the 8088 were a simple 8080 or 6800, the
BIU would just pass memory requests from the Execution Unit to
the outside world. However, Intel decide to exploit the idle times
resulting from the execution of long instructions by adding intelligence
to the BIU.
Code generally executes from a low address to a bigger one. Sure,
jumps and calls reverse the monotonically increasing fetch sequence,
but even in a short loop more often than not the next instruction
byte is located right after the one just fetched. The 8088's BIU
used this fact to keep the CPU to memory interface busy (i.e.,
maximize the bandwidth).
The BIU is really rather stupid. It just blindly keeps fetching
the next byte from RAM, storing it in a little FIFO between it
and the Execution Unit. When the EU is ready for the next instruction,
with luck it is already on-chip in the fast FIFO. A lot of memory
fetch delays are thus avoided, greatly increasing the processor's
throughput. The FIFO is small, holding only 4 bytes (6 on the
8086), but the typical match between RAM speed, CPU speed, and
FIFO size is such that most of the time the next byte is there
when needed. Not surprisingly, this process is called prefetching.
Once in a while the CPU will decode and execute some sort of branch
operation. Presumably the BIU will have at least partially filled
the FIFO with bytes located sequentially beyond the jump; bytes
that just are not needed. Jumps and calls flush the FIFO, erasing
these unneeded entries. The BIU then starts fetching from the
new execution address. Program transfers therefore essentially
stall the CPU; the processor must wait for the first instruction
at the jump destination before proceeding, just like any simple
non-prefetching computer would. Soon, however, the BIU will again
fill the FIFO, keeping a bit ahead of the EU's needs.
Some instructions read or write data to the memory array. Non-Harvard
machines (like the 8088) share one data bus for both instructions
and data, so a load or store operation causes a momentary disruption
in normal prefetching sequence. Loads and stores work around the
FIFO; they temporarily suspend prefetching, transfer the data,
and then resume without corrupting the FIFO's contents.
In general, a deeper FIFO lets the BIU get further ahead of the
Execution Unit, maximizing the memory channel bandwidth. It's
hard to say just what the most efficient FIFO size is. Different
CPUs have different depths; presumably this reflects the design
philosophies of the vendors (or perhaps just their ability to
cram more transistors on a die).
The Pipeline
Some microprocessor go a step further and incorporate a so-called
"Pipeline". The pipeline is similar to a prefetcher,
in that it blindly and dumbly reads ahead, trying too fill a queue
with sequential instructions that might be needed. It offers one
important speed-enhancing refinement - it partially executes every
queued instruction. Where with a prefetcher the queued instructions
just hang around in a FIFO, the pipeline contains logic that starts
to interpret each prefetched instruction.
RISC processors depend on pipelines to meet their much vaunted
one clock per instruction specification. Consider the R3000. This
nifty RISC processor includes a 5 stage pipeline, segmented into
the following operations:
Fetch the instruction
Read operands from registers and decode instruction
Do the operation (like ADD, OR, etc)
Access memory, if needed
Write results back to CPU registers
The R3000's pipeline is a 5 stage affair, each stage corresponding
to one of the above states. At any time the CPU is working on
5 instructions at the same time: an instruction is being handled
in each stage. After every clock all 5 march ahead to the next
stage.
This arrangement permits an effective execution rate of one instruction
every clock, but the first one takes 5 clocks to complete. It's
rather like the shipyards during World War 2 that cranked out
a ship every 48 hours. Each ship took months to build, but many
were in various stages of completion.
The Downside
Prefetchers and pipelines are truly wonderful speed-enhancing
features for any computer. Be aware, however, that in some conditions
their performance is less than thrilling. Rarely, though, will
the addition of either result in code that runs slower than on
a conventional fetch-then-execute machine (like the Z80).
Remember that both pipelines and prefetchers (which I'll just
call prefetchers here unless a distinction is really needed) flush
their contents whenever a program branch occurs. Obviously this
happens for all jumps, calls, and returns. Perhaps not so obviously,
every interrupt will flush them. Thus, the prefetcher offers no
improvement for interrupt latency, a big problem in embedded systems.
Some CPUs have multiple address spaces. Both the 68000 and Z280
include user and system areas that are quite different. Be careful
about changing modes (say, from system to user) without executing
some sort of jump that will clear the prefetcher. A simple mode
transfer might not cause a flush, leaving incorrect bytes behind.
Conditional jumps may behave a bit differently than you are used
to. The 64180 doesn't have a prefetcher of any sort, yet if it
knows it will not take a conditional jump (it can decide this
after reading the first byte), then it will not read the third
instruction byte. This frees up some memory bandwidth, and is
a nice, intelligent way to deal with conditionals. A prefetching
processor will usually read all of the bytes before making the
conditional jump decision, so some implicit bandwidth is lost.
Prefetcher stall is a very real problem. All memory bus activity
stops as the CPU goes to RAM for the argument of a load or store
operation. If the RAM asserts wait states, the entire system,
including prefetcher, is shut down for the memory transaction.
Where speed is a big problem and prefetcher stalls cannot be tolerated,
put frequently used variables in very fast memory.
Debugging with a prefetching processor can be the stuff of horror
movies. Don't figure on using a logic analyzer to capture the
execution stream. The prefetcher grabs instructions in a stupid,
monotonically increasing sequence of addresses. Branches or other
program transfers cause this to abruptly halt and restart at the
destination address. As a result, some instructions will only
be partially fetched, completely confusing the analyzer's disassembler.
Worse, an instruction is fetched long before it is executed. If
it transfers data the load/store argument will appear in the trace
data quite a bit after the instruction proper. How will you align
arguments with instructions?
Suppose you use a logic analyzer to trigger a pseudo breakpoint.
The analyzer only sees bus transactions, so it will break on the
instruction fetch, not its execution. This is simply wrong, since
the fetched instruction is often discarded when the prefetcher
is flushed after a jump. For example:
loop: do something
jnz loop
add bx,cx
If you wish to break when the loop completes, a breakpoint on
the ADD instruction should do the trick - shouldn't it? Actually,
a logic analyzer will trigger the break condition every time the
jump executes, since the prefetcher will surely read the ADD.
Some tools will handle this properly, but be sure to check with
the vendor.
Other Problems
Be wary of especially smart CPUs that have prefetchers - particularly
those with Memory Management Units. Suppose the MMU can produce
an address violation error if some illegal location is referenced.
If code is immediately adjacent to an illegal area, the prefetcher
might read ahead into the bad zone and create a false trigger
of the violation trap. Even without an MMU this is a problem if
your system uses a watchdog circuit that monitors the address
bus.
Similarly, be careful about memory mapped I/O when using a prefetcher.
Code that runs right next to an I/O port could cause an extraneous
prefetch read of the port.
Avoid all of these problems by keeping your code at least one
prefetch depth away from the end of a memory protection segment,
or from memory mapped I/O addresses.
Finally, be careful about self modifying code. If you really need
to use it (and yes, in the embedded world it is sometimes awfully
hard to avoid, despite the hysterical pleadings of an army of
structured programming advocates) be sure that when you modify
the code in memory you don't wind up with an unmodified copy in
the prefetch queue. I've been burned by this - it's not easy to
find.
Conclusion
Though I've drawn a distinction between the terms "pipeline"
and "prefetcher", most designers use them interchangeably.
The engineers at Zilog are always quick to correct my use of "prefetcher"
to "pipeline" when talking about their Z280. Intel is
just as adamant to insist I call their 80x88 implementation a
prefetcher. Perhaps I'm just an easy target. Is there a true industry-accepted
distinction between the terms? My definitions are not from Webster,
but gleaned from studying many vendor manuals.
|