Follow @jack_ganssle

Multicore Challenges

Summary: Moore's Law means more transistors per chip. But will those power-sucking semiconductors doom multicore?

In 1974 Robert Dennard came up with a scaling theory that drew on Moore's Law to promise ever-faster microprocessors. If from one generation to the next the transistor length shrinks by a factor of about 0.7 then the transistor budget doubles, speed goes up by 40%, total chip power remains the same, and a legion of other good things continues to be bestowed on the semiconductor industry.

Unfortunately Dennard scaling petered out at 90 nm. Clock rates stagnated and power budgets have grown at each process node. Many traditional tricks just don't work any more. For instance, shrinking transistors meant thinner gate oxide thicknesses, but once those hit 1.2 nm (about the size of five adjacent silicon atoms) tunneling created unacceptable levels of leakage. Semiconductor engineers replaced the silicon-dioxide insulator (with a dielectric constant of 3.9) with other materials like hafnium dioxide (dielectric constant = 25), to allow for somewhat thicker insulation. Voltages had to go down, but are limited by subthreshold leakages as the transistors' threshold voltage must inevitably decline. More leakage means greater power dissipation. A lot of innovative work is being done, like the use of 3D finFETs, but the Moore's manna of yore has, to a large extent, dried up.

Like the cavalry in a bad western multicore came riding to the rescue, and it's hard to go a day without seeing some new many-core CPU introduction. Most sport symmetric multiprocessing architectures, where two or more cores share some cache plus the main memory. Some problems can really profit from SMP, but many can't. Amdahl's Law tells us that even with an infinite number of cores an application that is 50% parallelizable will get only a 2x speedup over a single-core design. But that law is optimistic, and doesn't account for the inevitable bus conflicts that will occur when sharing L2 and main memory. Interprocessor communication, locks and the like make things even worse.

Data (https://share.sandia.gov/news/resources/news_releases/more-chip-cores-can-mean-slower-supercomputing-sandia-simulation-shows/) from Sandia National Labs shows that even for some very parallel problems multicore just doesn't scale after a small number of processors are involved.

In Power Challenges May End the Multicore Era (Communications of the ACM, February 2013, subscription required) the authors develop rather complex models that show multicore may (and the operative word is "may") bang into a dead-end due to power constraints. Soon.

The key takeaways are that by the 8 nm node (expected around 2018) more than 50% of the transistors on a microprocessor die will have to be dark, or turned off, at any one time just to keep the parts from self-destructing from overheating. The most optimistic scenarios show only a 7.9x speedup between the 45 nm and 8 nm nodes; a more conservative estimate pegs that at 3.7x. The latter is some 28 times less than one would expect from the gains Moore's Law has led us to expect.

I have some problems with the paper:  The authors assume an Intel/AMD-like CPU architecture. That is, huge, honking processors whose entire zeitgeist is performance. We in the embedded space are already power-constrained and generally use simpler CPUs. It's reasonable to assume a mid-level ARM part will run into the same issues, but perhaps not at 8 nm.  They don't discuss memory contention, locks and interprocessor communication. That's probably logical as their thesis is predicated on power constraints. But these issues will make the results even worse in real-world applications. The equations presented indicate no bus contention for shared L2 (and L2 is always shared on multicore CPUs) and none for main memory accesses. Given that L1 is tiny (32-64KB) one would expect plenty of L1 misses and thus lots of L2 activity... and therefore plenty of contention.  The models analyze applications in which 75% to 99% of the work can be done in parallel. Plenty of embedded systems won't come near 75%.  It appears the analysis assumes cache wait states are constant: 3 for L1, and 20 for L2. Historically that has not been the case - the 486 had zero wait state cache. It's hard to predict how caches in the future will behave, but assuming past trends continue the paper's conclusions will be even worse.  The paper figures on a linear relationship between frequency and performance, and the authors acknowledge that memory speeds don't support this assumption.

The last point is insanely hard to analyze. Miss rates for L1 and L2 are extremely dependent on the application. SDRAM is very slow for the first access to a block, though succeeding transfers happen very quickly indeed. But any transaction could take just three cycles (if in L1) to hundreds. One wonders how much tolerance a typical hard real-time system would have for such uncertainty.

Two conclusions are presented: the pessimistic one is the chicken little scenario where we hit a computational brick wall. Happily, the paper address a number of more optimistic possibilities ranging from microarchitecture improvements to unpredictable disruptive technologies. The latter has driven semiconductor technology for decades, and I for one am optimistic that some cool and unexpected inventions will continue to drive computer performance on its historical upward trajectory.

Published March 7, 2013