DMA is an important part of many embedded systems, yet far too
many of us don't really understand it. Read on...
Published in Embedded Systems Programming, October 1994
||For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.
Engineers love to torture the English language, using words in
convoluted ways, verbizing the most passive of nouns, and inventing
strange new words that might embarrass your mother (the word "dikes",
referring to diagonal cutters - wire cutters - comes to mind).
Acronyms are our special bane. Every three word noun phrase is
immediately shortened to its initials, even when this may make
an acronym of an acronym. The passage of time dulls ones memory
till the original words referred to by the letters slip away,
so the acronym becomes its own word. For example, though "CRT"
really refers only to a large tube, it has come to stand for a
complete monitor, electronics, tube, and all. Even names of corporations
reflect our reliance on verbal shorthand. IBM, GE, SAIC - an alphabet
soup of letters dances in front of our eyes. CACI even reincorporated
themselves some years back to make the acronym the new, real corporate
name, consigning the words the letters stood for to eternal obscurity.
Many embedded systems make use of DMA controllers. How many of
us remember that DMA stands for Direct Memory Access? What idiot
invented this meaningless phrase? I figure any CPU cycle directly
accesses memory. This is a case of an acronym conveniently sweeping
an embarrassing piece of verbal pomposity under the rug where
Regardless, DMA is nothing more than a way to bypass the CPU to
get to system memory and/or I/O. DMA is usually associated with
an I/O device that needs very rapid access to large chunks of
RAM. For example - a data logger may need to save a massive burst
of data when some event occurs.
DMA requires an extensive amount of special hardware to managing
the data transfers and to arbitrating access to the system bus.
This might seem to violate our desire to use software wherever
possible. However, DMA makes sense when the transfer rates exceed
anything possible with software. Even the fastest loop in assembly
language comes burdened with lots of baggage. A short code fragment
that reads from a port, stores to memory, increments pointers,
decrements a loop counter, and then repeats based on the value
of the counter takes quite a few clock cycles per byte copied.
A hardware DMA controller can do the same with no wasted cycles
and no CPU intervention.
Admittedly, modern processors often have blindingly fast looping
instructions. The 386's REPS (repeat string) moves data much faster
than most applications will ever need. However, the latency between
a hardware event coming true, and the code being ready to execute
the REPS, will surely be many microseconds even in the most carefully
crafted program - far too much time in many applications.
How it Works
Processors provide one or two levels of DMA support. Since the
dawn of the micro age just about every CPU has had the basic bus
exchange support. This is quite simple, and usually consists of
just a pair of pins.
"Bus Request" (AKA "Hold" on Intel CPUs) is
an input that, when asserted by some external device, causes the
CPU to tri-state it's pins at the completion of the next instruction.
"Bus Grant" (AKA "Bus Acknowledge" or "Hold
Acknowledge") signals that the processor is indeed tristated.
This means any other device can put addresses, data, and control
signals on the bus. The idea is that a DMA controller can cause
the CPU to yield control, at which point the controller takes
over the bus and initiates bus cycles. Obviously, the DMA controller
must be pretty intelligent to properly handle the timing and to
drive external devices through the bus.
Modern high integration processors often include DMA controllers
built right on the processor's silicon. This is part of the vendors'
never-ending quest to move more and more of the support silicon
to the processor itself, greatly reducing the cost and complexity
of building an embedded system. In this case the Bus Request and
Bus Grant pins are connected to the onboard controller inside
of the CPU package, though they usually come out to pins as well,
so really complex systems can run multiple DMA controllers. It's
a scary thought....
Every DMA transfer starts with the software programming the DMA
controller, the device (either on-board a integration CPU chip
or a discrete component) that manages these transactions. The
code must typically set up destination and source pointers to
tell the controller where the data is coming from, and where it
is going to. A counter must be programmed to track the number
of bytes in the transfer. Finally, numerous bits setting the DMA
mode and type must be set. These may include source or destination
type (I/O or memory), action to take on completion of the transaction
(generate an interrupt, restart the controller, etc.), wait states
for each CPU cycle, etc.
Now the DMA controller waits for some action to start the transfer.
Perhaps an external I/O device signals it is ready by toggling
a bit. Sometimes the software simply sets a "start now"
flag. Regardless, the controller takes over the bus and starts
Each DMA transfer looks just like a pair of normal CPU cycles.
A memory or I/O read from the DMA source address is followed by
a corresponding write to the destination. The source and destination
devices cannot tell if the CPU is doing the accesses or if the
DMA controller is doing them.
During each DMA cycle the CPU is dead - it's waiting patiently
for access to the bus, but cannot perform any useful computation
during the transfer. The DMA controller is the bus master. It
has control until it chooses to release the bus back to the CPU.
Depending on the type of DMA transfer and the characteristics
of the controller, a single pair of cycles may terminate to allow
the CPU to run for a while, or a complete block of data may be
moved without bringing the processor back from the land of the
Once the entire transfer is complete the DMA controller may quietly
go to sleep, or it may restart the transfer when the I/O is ready
for another block, or it may signal the processor that the action
is complete. In my experience this is always signaled via an interrupt.
The controller interrupts the CPU so the firmware can take the
To summarize, the processor programs the DMA controller with parameters
about the transfers, the controller tri-states the CPU and moves
data over the CPU bus, and then when the entire block is done
the controller signals completion via an interrupt.
DMA controllers are wondrous and scary things, each with dozens
of registers you must program just right to get any sort of response.
I think these things are designed by committee, with each member
throwing every possible feature into the chip. Though it's nice
to have so much capability, writing the code can be a trial.
You must first have a very clear idea of exactly what sort of
DMA transfers your system needs. DMA was invented to move data
between I/O and memory, but now people use it for a variety of
other reasons as well.
Traditional Synchronous DMA moves a byte or word at a time between
system memory and a peripheral, handshaking with the I/O port
for each transfer. This sort of transfer recognizes that the port
may not always be in a ready condition; the handshaking is a hardware
mechanism to throttle the transactions.
With this sort of transfer, the program sets up the controller
and then carries on, oblivious to the state of the DMA transaction.
The hardware moves one byte or word between memory and I/O each
time the I/O port signals it is ready for another transaction.
On each read indication, the DMA controller asserts Bus Request,
waits for a Bus Acknowledge in response, and then takes over the
bus for a single cycle. Then, the DMA controller goes idle again,
waiting for another ready signal from the port. Thus, the program
and DMA cycles share bus cycles, with the controller winning any
contest for control of the bus. Sometimes this is called "Cycle
Burst Mode DMA, in contrast, generally assumes that the destination
and source addresses can take transfers as fast as the controller
can generate them. The program sets up the controller, and then
(perhaps after a single ready indication from a port occurs),
the entire source block is copied to the destination. The DMA
controller gains exclusive access to the bus for the duration
of the transfer, during which time the program is effectively
shut down. Burst mode DMA can transfer data very rapidly indeed.
Flyby DMA, something that is not supported on many controllers,
is a beast of a different color. The DMA controller gains access
to the bus and puts the source or destination address out. Then,
it initiates what is in effect a read and a write cycle simultaneously.
The data is read from the source address, and written to the destination,
at the same time. This implies that either the source or destination
does not require an address, since it is very unlikely that both
would use the same. An example might be copying data from memory
to a FIFO port - the source address (a pointer to memory) increments
on each transfer, while the destination is always the same FIFO.
Flyby transactions are very fast since the read/write cycle pair
is reduced to a single cycle. Both burst and synchronous types
of transfers can be supported.
The original IBM PC, that 8088 based monstrosity we all once yearned
for but now snicker at, used a DMA controller to generate dynamic
RAM refresh addresses. It simply ran a null transfer every few
milliseconds to generate the addresses needed by the DRAMs.
This was a very clever design - a normal refresh controller is
a mess of logic. The only down side was that the PC's RAM was
non-functional until the power-up code properly programmed the
Both floppy and hard disk controllers often use DMA to transfer
data to and from the drive. This is a natural and perfect application.
The software arms the controller and then carries on. The hardware
waits for the drive to get to the correct position and then performs
the transfer without further reliance on the system software.
If I'm working on a microprocessor with a "free" DMA
controller (one built onto the chip), I'll sometimes use it for
large memory transfers. This is especially useful with processors
with segmented address spaces, like the 80188 or Z180 (the Z180's
space is not actually segmented, but is limited to 64k without
software intervention to program the MMU).
Both of these CPUs include on-board DMA controllers that support
transfers over the entire 1 Mb address space of the part. Judicious
programming of the controller lets you do a simple and easy memory
copy of any size to any address - all without worrying about segment
registers or the MMU. This is yet another argument for encapsulation:
write a DMA routine once, debug it thoroughly, and then reuse
it even for mundane tasks.
Over the years I've profiled a lot of embedded code, and in many
instances have found that execution time seems to be really burned
up by string copy and move routines inside the C runtime library.
If you have a spare DMA controller channel, why not recode portions
of the library to use a DMA channel to do the moves? Depending
on the processor a tremendous speed improvement can result.
I'll never forget the one time I should have used DMA, but didn't.
As a consultant, rushed to get a job done, I carelessly threw
together a hardware design figuring somehow I could make things
work by tuning the software. For some inexplicable reason I did
not put a DMA controller on the system, and suffered for weeks,
tuning a few instructions to move data off a tape drive without
missing bytes. A little more forethought would have made a big
This is a case where a thorough knowledge of the hardware is essential
to making the software work. DMA is almost impossible to troubleshoot
without using a logic analyzer.
No matter what mode the transfers will ultimately use, and no
matter what the source and destination devices are, I always first
write a routine to do a memory to memory DMA transfer. This is
much easier to troubleshoot than DMA to a complex I/O port. You
can use your ICE to see if the transfer happened (by looking at
the destination block), and to see if exactly the right number
of bytes were transferred.
At some point you'll have to recode to direct the transfer to
your device. Hook up a logic analyzer to the DMA signals on the
chip to be sure that the addresses and byte count are correct.
Check this even if things seem to work - a slight mistake might
trash part of your stack or data space.
Some high integration CPUs with internal DMA controllers do not
produce any sort of cycle that you can flag as being associated
with DMA. This drives me nuts - one lousy extra pin would greatly
ease debugging. The only way to track these transfers is to trigger
the logic analyzer on address ranges associated with the transfer,
but unfortunately these ranges may also have non-DMA activity
Be aware that DMA will destroy your timing calculations. Bit banging
UARTs will not be reliable; carefully crafted timing loops will
run slower than expected. In the old days we all counted T-states
to figure how long a loop ran, but DMA, prefetchers, cache, and
all sorts of modern exoticness makes it almost impossible to calculate
real execution time.
On another subject, some time ago I wrote about using oscilloscopes
to debug software. The subject is really too big to do justice
in a couple of short magazine pieces. However, I just received
a booklet from Tektronix that does justice to the subject. It's
called "Basic Concepts - XYZ of Analog and Digital Oscilloscopes",
and is their publication number 070869001. Highly recommended.
Get the book, borrow a scope, and play around for a while. It's
fun and tremendously worthwhile.