Bus Cycles
Software folks need to understand how a microprocessor handles
data on its busses. Here's the short intro.
Published in ESP, March 1995.
 |
For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe. |
My October column about DMA brought a number of interesting replies.
Quite a few readers commented that they just did not have a good
idea "what all this bus cycle stuff is about". After
all, this is a software magazine; it makes sense that hardware
details, while critical to the embedded world, are a bit fuzzy
to many readers.
How many engineers understand, in detail, computer operation?
Over the years I've asked each technical person I've hired for
a blow-by-blow description of how a microprocessor fetches, decodes,
and executes instructions. What does the instruction pointer do?
How are JMPs decoded and processed? Surprisingly few can give
an accurate, detailed explanation. Engineers are taught to treat
complex components like microprocessors as block diagrams. Connect
the right peripherals according to the cookbook and you'll never
need to understand the block's internal operations.
Though modern society is built on this concept of increasing abstraction,
I maintain that some sort of understanding of the operation of
fundamental devices (like computers) is essential to efficient
solving of complex problems. Knowledge - of anything - makes life
more interesting. Unfortunately a thorough understanding computer
operation is not terribly likely to enhance your cocktail party
repartee: "exactly how much setup time does your system need?",
she asked, breathlessly, a slight tinge of pink coloring her cheeks.
The party's hum faded into the background; my vision tunneled
into sharp focus on her... Uh, wrong magazine. Suffice to say,
inquiring minds want to know.
Even the purest software type working in the embedded industry
will hear no end of discussion about bus cycles, wait states,
read and writes, and the like when waiting for the hardware weenies
to put yet another Band-Aid on the malfunctioning prototype hardware.
Here's the 10 minute introduction to bus cycles, so you can at
least look knowledgeable.
Tic Toc
Everyone bandies the word "clock" around. It's common
knowledge that the faster your clock runs the more information
processing gets done per unit time. How does the clock rate relate
to processor speed, and why is one even needed?
A science wag half-facetiously commented that time is what keeps
everything from happening all at once. The same reasoning applies
to a computer system. The clock signal, which is produced by a
simple oscillator on the computer board, sequences processor operations
to give the circuits time to complete one operation before proceeding
to the next. Logic takes a while to do anything: RAM may need
50-100 nanoseconds to retrieve a byte, an adder might require
10 or more nanoseconds to compute a result. The system clock insures
each operation finishes before the next takes place.
Time is related to clock frequency by: time=1/frequency. A 10
MHz oscillator gives 100 nanoseconds per clock cycle. Just to
confuse things, many processors divide the input clock frequency,
most of the time by 2. Your 10 MHz oscillator going into a 80186
actually creates a 5 MHz, or 200 nsec, cycle time.
Each clock period (the time required for one complete cycle) is
called a "T state", and is the basic unit of processor
timing. Nothing completes in less than a T state, though
propagation times through individual components will generally
faster than the T state time. Designers select a T state interval
greater than the sum of all propagation delays in the memory and
I/O paths. (We will see that memories are notoriously slow; you
can inject wait states to add T states to read or write cycles
to avoid the use of very expensive fast devices.)
During a single T state the processor can do very little - perhaps
output the address of the next memory location needed. An entire
memory read or write cycle requires 2 or more T states, depending
on the processor selected. A "machine cycle" is the
entire time - 2, 3, or more T states - required to perform a single
read or write.
(RISC systems are generally single-T state machines. They complete
an entire instruction in one clock cycle, generally by overlapping
several operations at once.)
The venerable Z80 uses 4 T states per machine cycle. Run it at
10 MHz (100 nsec) and you'll need 400 nsec per machine cycle.
Zilog's Z180 is an improved Z80 that, among other things, is considerably
speedier. Most of the performance gain comes from both a faster
clock and a reduction in the number of T states from 4 to 3.
The 386 i and
bus width were the same as a Z80 (dear Intel - please forgive
me!), it would run at twice the speed of the 4 T state Z80 at
the same clock rate.
Instructions
We've talked about T states and Machine cycles - how long does
an instruction take?
Most instructions are composed of the following pieces: a "fetch"
process, instruction decoding, perhaps some more fetching, and
execution. Let's look at a few typical operations to see how these
work, using the Z80 for simplicity's sake.
A NOP is opcode 00. Despite the obvious fact that this is a "do
nothing" command, the processor must at the very least read
the instruction from memory (the "fetch" cycle), decode
it, and then very cleverly execute it ("do nothing").
The NOP executes in one machine cycle. Essentially all of the
cycle is devoted to fetching the 00 from memory. As I mentioned,
this requires 4 T states on a Z80. After completing the fetch
the processor very quickly decodes the instruction, realizes that
there's nothing to do, and then starts another machine cycle to
get the next instruction.
Now consider a JMP 0100. An absolute jump to a 16 bit destination
address, regardless of processor, needs at least (and in this
case exactly) a three byte instruction: the opcode itself (C3
on the Z80), and two bytes of destination address.
On the Z80 three machine cycles handle the jump. The first is
a fetch that reads the C3 opcode. The CPU quickly (near the end
of this first cycle) realizes that C3 implies a jump requiring
a two byte operand. It therefore issues two back-to-back reads
- each an entire machine cycle - to bring in the destination address.
Only then can the Z80 load its instruction pointer with 0100 and
start sucking in code from the new address. A simple JMP takes
12 T-states: three machine cycles at 4 t states per. With a 10
MHz clock, we're looking at 1.2 microseconds execution time.
Here's where you find the first benefit of 16 or 32 bit computers:
the wider bus reduces the number of fetches needed to read long
instructions. Each machine cycle takes time... a lot of time.
Any time you can eliminate these by using smaller, smarter instructions
or a wider bus you'll get substantial performance improvements.
Now consider LD A,(HL). This one byte opcode tells the Z80 to
read from memory at the address contained in register pair HL.
The single byte opcode takes but a single machine cycle to fetch.
Once read and decoded, though, the CPU must put the contents of
HL on the address bus and read whatever is found there into register
A - requiring another machine cycle.
Taking this one step further, execute a POP HL. Again, this single
byte opcode needs only one fetch cycle. After decoding the meaning
of the byte, the Z80 realizes that two bytes (16 bits) are required
from the stack. It starts a second read cycle, now with the stack
pointer providing the address. A third then commences, this time
at address SP+1. Here, a single byte opcode needed three machine
states. Of course, if you were clever enough to use a 16 bit
processor the entire operation could complete in two: a fetch,
and a word-wide read at the SP address.
Here's a case where a 32 bit processor brings no implicit advantage.
It still needs two cycles, even though we're only transferring
24 bits (8 bits of opcode and 16 of stack data), because the opcode
is at one address and the stack is (hopefully!) somewhere else.
The CPU can only issue a single address - a single memory operation
- at a time. Of course, Harvard architecture machines, like most
DSPs, have separate data and instruction busses, and can run simultaneous
transfers. There's a corresponding performance improvement.
In these examples the instruction execution time was buried in
the read and fetch cycles. A JMP, POP, ADD, and most other operations
are quite simple. Others are not. The Z180 includes integer multiply
and divides, which use the "shift and add/subtract"
algorithm. Execution time is a function of the operands supplied.
A single machine cycle can take many, many clocks as the bus lies
idle (nothing to transfer between the processor and memory), but
as the CPU whirs along, thinking very hard.
Here's where adding transistors to a device improves performance.
Use a barrel shifter (a sort of parallel shifter that works in
a single clock cycle), and the multiply times approach zero. Bus
width reduces machine cycles by doing more at once; transistors
shorten long machine cycles by completing complex operations faster.
Fetch, Read or Write?
Though some CPUs support oddball machine cycles (like DRAM refresh),
virtually every cycle is either a Fetch, Read, or Write. Fetches
read instructions - always from memory. Read and write cycles
transfer operand data, from memory or I/O.
Fetch and memory read cycles intuitively feel the same. Both read
bytes from RAM or ROM. The difference is subtle. Many processors
have a "fetch" signal that differentiates the two. Sometimes,
as on the Z80, the exact timing may differ a bit between the two.
As the industry evolves, though, the difference in timing and
signals is disappearing. Often it's all but impossible to tell
what the CPU is doing by watching the bus, unless you notice that
fetches generally are from increasing addresses (programs execute
from low addresses to higher ones, unless there's a program transfer),
while memory reads occur much less frequently, generally from
addresses not near the code. The hardware doesn't care or need
to know what's going on. An address comes out, the CPU asserts
the read signal, and the selected memory device transfers data
back.
In the preceding discussion we've looked at common instructions,
and have found that each one is nothing more than a sequence of
machine cycles. If we ignore interrupts, refresh, and other infrequent
intruders, there are only two basic kinds of machine cycles: reads
and writes. Let's look at what goes on during a cycle.
The figure shows how a Z180 read cycle. This is a typical timing
diagram of the sort hardware folks sweat over, representing how
the signals on the computer bus change over time. If you connected
the signals shown to a logic analyzer you'd see just this sort
of display.
The top signal, clock, provides the basic timing reference to
the system. Each cycle is one T-state, as indicated by the "T1",
"T2", and "T3" designations.
Shortly after the cycle begins the processor provides an address.
I've represented the 16 address lines by one bus; if it wanted
to read from 0100, then A15 to A0 are all zeroes except for A8.
At about the same time the CPU also asserts its Memory Read signal.
The processor is telling the memory array to return a byte from
address 100. The CPU drives address and Memory Read; now it wants
memory to drive data back to it on the data bus. You'll notice
that the addresses remain valid during the entire time Memory
Read is asserted. These stable signals go into a ROM, say, giving
the ROM time to pull data from the selected address and put it
on the bus. Remember, memories are slow.
Sooner or later the ROM data will be valid. The processor specification
tells the system designer how much time is allowed before valid
data must appear. The timing diagram shows that the ROM
better respond a little bit before Read goes away. This is called
the minimum "setup" time.
When the processor removes Memory Read the cycle is almost over.
Another specification, "hold time", specifies how long
the data from the ROM should remain on the processor's data bus
after Read disappears.
Setup and hold times are truly critical. Violate the minimums,
and your system will erratically crash.
But... that's it. What could be simpler?
Now, getting the timing just right can be tedious, since good
design implies using memories that are fast enough, but not too
fast (speed is expensive). High speed clocks cause all sorts of
trouble in insuring the setup and hold times are met. I don't
want to denigrate the problems faced by a hardware designer! It
is important, though, to realize that the basics of timing are
really quite simple. Ah, just as the basics of programming are
no great mystery. Well, OK, maybe anyone with less than a genius
IQ will be able to figure out your code, but...
Write and I/O cycles are very similar. The timing might shift
a little, but the concepts are the same. The biggest difference
is that during a Write cycle the processor drives address, Memory
Write, and the data lines to memory or I/O.
Suppose the ROMs are too slow? Since you can't speed up a slow
memory chip you have to slow down the computer. A wait state
stretches the time during which Read or Write is asserted, giving
the device more time to decode an address. Each processor has
a wait state input, and associated specification for driving this
line. If you assert it by a particular time in the cycle, the
processor will generate additional T-states while keeping the
address and read or write valid. In effect, using the example
in the figure, you'll get extra T2 states for as long as the Wait
input is asserted.
The penalty for using a wait state depends on the processor. One
wait state on a Z80 stretches the machine cycle from 4 to 5 T
states - not a tremendous change. Add one wait to a two T state
machine like the 386, and you've suffered a 50% performance penalty.
Ugh!
Since we use waits simply as a way to save money by using cheap,
slow memories, you can avoid this performance hit by using cache
RAM. Cache is a smallish chunk of very fast (read, expensive)
RAM that runs with zero wait states. Given that computers often
run in small loops, smart hardware can track these loops, keeping
the most recently-used parts of the code in very fast cache. Any
access outside of the cache will incur a wait state, but, if cleverly
implemented, better than 90% "hit" rates are not uncommon.
Your 33 MHz 486 most likely has a quarter meg or so of 25 nsec
cache (very costly RAM), yet lives very happily with many Mb of
cheap 70 nsec DRAMs running with one or two wait states. We'd
all like 32 Mb of fast RAM; not many of us can afford it.
Conclusion
Software folks are well aware that very simple instructions, executed
at a mind-boggling rate, give the complex actions of a sophisticated
program. The hardware is no different. Simple T states build machine
cycles which result in instruction execution. Each component is
trivial, but when repeated millions of times per second makes
a computer the wonderful widget it is.
I leave you with one thought: when in the throes of fighting some
nasty, intransigent bug, when the CPU seems to have a malicious
mind of its own, remember it's only a very simple machine. You
are smarter than it is.
|