By Jack Ganssle

Nulticore Continued

Published 12/15/2008

Last week (http://embedded.com/columns/breakpoint/212300032) I slammed the multicore hype that pervades this industry. If one were to believe the PR, all of our performance problems go away when we just add a few cores.

The most widely-advertised approach to multicore is called Symmetric Multiprocessing (SMP). Two or more identical CPUs coexist on a single chip. Each has a tiny bit of L1 cache, but generally at least two cores share an L2, and all of the cores have a common memory and bus. Obviously, fast CPUs insulated from each other by tiny caches will suffer bus contention and degraded performance.

Our entire computing paradigm is wrong. It stems from the mainframe era when computers cost money. A lot of money. So much that computing centers evolved, entire buildings with armies of acolytes devoted to a single machine. Computers cost so much that many companies used time-sharing, where precious CPU cycles were miserly doled out to multiple users, each of whom paid handsomely for their small piece of the pie.

Minicomputers were expensive, too, so were used sparingly. Even microprocessors cost a lot; the 8080 CPU was $400 a pop, just for the processor chip itself. That's in 1975 dollars and is equivalent to around $1600 today.

We talk about the effects of Moore's Law without examining the implications. If transistors get ever cheaper, then so do processors. At some point a CPU is, to a first approximation, free. Today many CPUs cost tens of cents. On an SoC a 32 bitter might need just a couple of square millimeters.

Nearly-free processors change the entire computing paradigm that has been predicated on big, expensive iron. Instead of bringing the problem to the computer, bring computers to the problem.

The situation is similar to manufacturing. Labor is cheap (or was; today factories use cheap robots). Raw materials dumped into one side of the building are slowly transformed into useful products via a number of stations. At each stop a small change is made; a worker inserts a widget or tightens a bolt before the assembly line brings the product to another station for another incremental improvement.

Consider how we'd calculate, say, color, using the usual single-processor model. One CPU would read three A/D converters (R, G and B), smooth each of the data points, apply calibration corrections to each, transform each to some sort of screen format, and update displays. The amount of work involved just in moving instructions and data around is staggering - and none of that work is useful. It's all just infrastructure that supports a single CPU doing many independent things.

The assembly-line model changes everything. Build three assembly lines, one per color. Each line has multiple stations with independent "laborers" - a processor. One handles the A/D and passes data to another that computes an IIR. That hands the smoothed data to a station that applies calibration constants, which in turn passes it along for further processing. This streaming model reduces the load/store overhead and each processor, doing just a small share of the work, contributes to vastly-better system performance.

The processors may be identical or can vary depending on need; perhaps one station needs a DSP while another gets by with a brain-dead 8 bitter. Each has its own local memory which, since the workload is small, can be relatively tiny. This is called Asymmetric Multiprocessing (AMP) and is quite orthogonal to SMP. AMP scales better to most of the problems we face in the embedded world.

Though SMP is a useful way to solve many compute-intensive problems, it's hardly the only solution. A number of vendors are currently offering AMP products, both as physical processors and as IP for use on FPGAs or SoCs. If the problem you are trying to solve can be structured to resemble an assembly line, check into the AMP alternatives.