Software folks need to understand how a microprocessor handles data on its busses. Here's the short intro.
Published in ESP, March 1995.
|For novel ideas about building embedded systems (both hardware and firmware), join the 27,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.|
By Jack Ganssle
My October column about DMA brought a number of interesting replies. Quite a few readers commented that they just did not have a good idea "what all this bus cycle stuff is about". After all, this is a software magazine; it makes sense that hardware details, while critical to the embedded world, are a bit fuzzy to many readers.
How many engineers understand, in detail, computer operation? Over the years I've asked each technical person I've hired for a blow-by-blow description of how a microprocessor fetches, decodes, and executes instructions. What does the instruction pointer do? How are JMPs decoded and processed? Surprisingly few can give an accurate, detailed explanation. Engineers are taught to treat complex components like microprocessors as block diagrams. Connect the right peripherals according to the cookbook and you'll never need to understand the block's internal operations.
Though modern society is built on this concept of increasing abstraction, I maintain that some sort of understanding of the operation of fundamental devices (like computers) is essential to efficient solving of complex problems. Knowledge - of anything - makes life more interesting. Unfortunately a thorough understanding computer operation is not terribly likely to enhance your cocktail party repartee: "exactly how much setup time does your system need?", she asked, breathlessly, a slight tinge of pink coloring her cheeks. The party's hum faded into the background; my vision tunneled into sharp focus on her... Uh, wrong magazine. Suffice to say, inquiring minds want to know.
Even the purest software type working in the embedded industry will hear no end of discussion about bus cycles, wait states, read and writes, and the like when waiting for the hardware weenies to put yet another Band-Aid on the malfunctioning prototype hardware. Here's the 10 minute introduction to bus cycles, so you can at least look knowledgeable.
Everyone bandies the word "clock" around. It's common knowledge that the faster your clock runs the more information processing gets done per unit time. How does the clock rate relate to processor speed, and why is one even needed?
A science wag half-facetiously commented that time is what keeps everything from happening all at once. The same reasoning applies to a computer system. The clock signal, which is produced by a simple oscillator on the computer board, sequences processor operations to give the circuits time to complete one operation before proceeding to the next. Logic takes a while to do anything: RAM may need 50-100 nanoseconds to retrieve a byte, an adder might require 10 or more nanoseconds to compute a result. The system clock insures each operation finishes before the next takes place.
Time is related to clock frequency by: time=1/frequency. A 10 MHz oscillator gives 100 nanoseconds per clock cycle. Just to confuse things, many processors divide the input clock frequency, most of the time by 2. Your 10 MHz oscillator going into a 80186 actually creates a 5 MHz, or 200 nsec, cycle time.
Each clock period (the time required for one complete cycle) is called a "T state", and is the basic unit of processor timing. Nothing completes in less than a T state, though propagation times through individual components will generally faster than the T state time. Designers select a T state interval greater than the sum of all propagation delays in the memory and I/O paths. (We will see that memories are notoriously slow; you can inject wait states to add T states to read or write cycles to avoid the use of very expensive fast devices.)
During a single T state the processor can do very little - perhaps output the address of the next memory location needed. An entire memory read or write cycle requires 2 or more T states, depending on the processor selected. A "machine cycle" is the entire time - 2, 3, or more T states - required to perform a single read or write.
(RISC systems are generally single-T state machines. They complete an entire instruction in one clock cycle, generally by overlapping several operations at once.)
The venerable Z80 uses 4 T states per machine cycle. Run it at 10 MHz (100 nsec) and you'll need 400 nsec per machine cycle. Zilog's Z180 is an improved Z80 that, among other things, is considerably speedier. Most of the performance gain comes from both a faster clock and a reduction in the number of T states from 4 to 3.
The 386 is a two T state machine. If it's instruction set and bus width were the same as a Z80 (dear Intel - please forgive me!), it would run at twice the speed of the 4 T state Z80 at the same clock rate.
We've talked about T states and Machine cycles - how long does an instruction take?
Most instructions are composed of the following pieces: a "fetch" process, instruction decoding, perhaps some more fetching, and execution. Let's look at a few typical operations to see how these work, using the Z80 for simplicity's sake.
A NOP is opcode 00. Despite the obvious fact that this is a "do nothing" command, the processor must at the very least read the instruction from memory (the "fetch" cycle), decode it, and then very cleverly execute it ("do nothing").
The NOP executes in one machine cycle. Essentially all of the cycle is devoted to fetching the 00 from memory. As I mentioned, this requires 4 T states on a Z80. After completing the fetch the processor very quickly decodes the instruction, realizes that there's nothing to do, and then starts another machine cycle to get the next instruction.
Now consider a JMP 0100. An absolute jump to a 16 bit destination address, regardless of processor, needs at least (and in this case exactly) a three byte instruction: the opcode itself (C3 on the Z80), and two bytes of destination address.
On the Z80 three machine cycles handle the jump. The first is a fetch that reads the C3 opcode. The CPU quickly (near the end of this first cycle) realizes that C3 implies a jump requiring a two byte operand. It therefore issues two back-to-back reads - each an entire machine cycle - to bring in the destination address. Only then can the Z80 load its instruction pointer with 0100 and start sucking in code from the new address. A simple JMP takes 12 T-states: three machine cycles at 4 t states per. With a 10 MHz clock, we're looking at 1.2 microseconds execution time.
Here's where you find the first benefit of 16 or 32 bit computers: the wider bus reduces the number of fetches needed to read long instructions. Each machine cycle takes time... a lot of time. Any time you can eliminate these by using smaller, smarter instructions or a wider bus you'll get substantial performance improvements.
Now consider LD A,(HL). This one byte opcode tells the Z80 to read from memory at the address contained in register pair HL. The single byte opcode takes but a single machine cycle to fetch. Once read and decoded, though, the CPU must put the contents of HL on the address bus and read whatever is found there into register A - requiring another machine cycle.
Taking this one step further, execute a POP HL. Again, this single byte opcode needs only one fetch cycle. After decoding the meaning of the byte, the Z80 realizes that two bytes (16 bits) are required from the stack. It starts a second read cycle, now with the stack pointer providing the address. A third then commences, this time at address SP+1. Here, a single byte opcode needed three machine states. Of course, if you were clever enough to use a 16 bit processor the entire operation could complete in two: a fetch, and a word-wide read at the SP address.
Here's a case where a 32 bit processor brings no implicit advantage. It still needs two cycles, even though we're only transferring 24 bits (8 bits of opcode and 16 of stack data), because the opcode is at one address and the stack is (hopefully!) somewhere else. The CPU can only issue a single address - a single memory operation - at a time. Of course, Harvard architecture machines, like most DSPs, have separate data and instruction busses, and can run simultaneous transfers. There's a corresponding performance improvement.
In these examples the instruction execution time was buried in the read and fetch cycles. A JMP, POP, ADD, and most other operations are quite simple. Others are not. The Z180 includes integer multiply and divides, which use the "shift and add/subtract" algorithm. Execution time is a function of the operands supplied. A single machine cycle can take many, many clocks as the bus lies idle (nothing to transfer between the processor and memory), but as the CPU whirs along, thinking very hard.
Here's where adding transistors to a device improves performance. Use a barrel shifter (a sort of parallel shifter that works in a single clock cycle), and the multiply times approach zero. Bus width reduces machine cycles by doing more at once; transistors shorten long machine cycles by completing complex operations faster.
Fetch, Read or Write?
Though some CPUs support oddball machine cycles (like DRAM refresh), virtually every cycle is either a Fetch, Read, or Write. Fetches read instructions - always from memory. Read and write cycles transfer operand data, from memory or I/O.
Fetch and memory read cycles intuitively feel the same. Both read bytes from RAM or ROM. The difference is subtle. Many processors have a "fetch" signal that differentiates the two. Sometimes, as on the Z80, the exact timing may differ a bit between the two. As the industry evolves, though, the difference in timing and signals is disappearing. Often it's all but impossible to tell what the CPU is doing by watching the bus, unless you notice that fetches generally are from increasing addresses (programs execute from low addresses to higher ones, unless there's a program transfer), while memory reads occur much less frequently, generally from addresses not near the code. The hardware doesn't care or need to know what's going on. An address comes out, the CPU asserts the read signal, and the selected memory device transfers data back.
In the preceding discussion we've looked at common instructions, and have found that each one is nothing more than a sequence of machine cycles. If we ignore interrupts, refresh, and other infrequent intruders, there are only two basic kinds of machine cycles: reads and writes. Let's look at what goes on during a cycle.
The figure shows how a Z180 read cycle. This is a typical timing diagram of the sort hardware folks sweat over, representing how the signals on the computer bus change over time. If you connected the signals shown to a logic analyzer you'd see just this sort of display.
The top signal, clock, provides the basic timing reference to the system. Each cycle is one T-state, as indicated by the "T1", "T2", and "T3" designations.
Shortly after the cycle begins the processor provides an address. I've represented the 16 address lines by one bus; if it wanted to read from 0100, then A15 to A0 are all zeroes except for A8. At about the same time the CPU also asserts its Memory Read signal.
The processor is telling the memory array to return a byte from address 100. The CPU drives address and Memory Read; now it wants memory to drive data back to it on the data bus. You'll notice that the addresses remain valid during the entire time Memory Read is asserted. These stable signals go into a ROM, say, giving the ROM time to pull data from the selected address and put it on the bus. Remember, memories are slow.
Sooner or later the ROM data will be valid. The processor specification tells the system designer how much time is allowed before valid data must appear. The timing diagram shows that the ROM better respond a little bit before Read goes away. This is called the minimum "setup" time.
When the processor removes Memory Read the cycle is almost over. Another specification, "hold time", specifies how long the data from the ROM should remain on the processor's data bus after Read disappears.
Setup and hold times are truly critical. Violate the minimums, and your system will erratically crash.
But... that's it. What could be simpler?
Now, getting the timing just right can be tedious, since good design implies using memories that are fast enough, but not too fast (speed is expensive). High speed clocks cause all sorts of trouble in insuring the setup and hold times are met. I don't want to denigrate the problems faced by a hardware designer! It is important, though, to realize that the basics of timing are really quite simple. Ah, just as the basics of programming are no great mystery. Well, OK, maybe anyone with less than a genius IQ will be able to figure out your code, but...
Write and I/O cycles are very similar. The timing might shift a little, but the concepts are the same. The biggest difference is that during a Write cycle the processor drives address, Memory Write, and the data lines to memory or I/O.
Suppose the ROMs are too slow? Since you can't speed up a slow memory chip you have to slow down the computer. A wait state stretches the time during which Read or Write is asserted, giving the device more time to decode an address. Each processor has a wait state input, and associated specification for driving this line. If you assert it by a particular time in the cycle, the processor will generate additional T-states while keeping the address and read or write valid. In effect, using the example in the figure, you'll get extra T2 states for as long as the Wait input is asserted.
The penalty for using a wait state depends on the processor. One wait state on a Z80 stretches the machine cycle from 4 to 5 T states - not a tremendous change. Add one wait to a two T state machine like the 386, and you've suffered a 50% performance penalty. Ugh!
Since we use waits simply as a way to save money by using cheap, slow memories, you can avoid this performance hit by using cache RAM. Cache is a smallish chunk of very fast (read, expensive) RAM that runs with zero wait states. Given that computers often run in small loops, smart hardware can track these loops, keeping the most recently-used parts of the code in very fast cache. Any access outside of the cache will incur a wait state, but, if cleverly implemented, better than 90% "hit" rates are not uncommon.
Your 33 MHz 486 most likely has a quarter meg or so of 25 nsec cache (very costly RAM), yet lives very happily with many Mb of cheap 70 nsec DRAMs running with one or two wait states. We'd all like 32 Mb of fast RAM; not many of us can afford it.
Software folks are well aware that very simple instructions, executed at a mind-boggling rate, give the complex actions of a sophisticated program. The hardware is no different. Simple T states build machine cycles which result in instruction execution. Each component is trivial, but when repeated millions of times per second makes a computer the wonderful widget it is.
I leave you with one thought: when in the throes of fighting some nasty, intransigent bug, when the CPU seems to have a malicious mind of its own, remember it's only a very simple machine. You are smarter than it is.