For novel ideas about building embedded systems (both hardware and firmware), join the 40,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.

Fixed Point Math
Summary: Ints have limited ranges. Floats can be slow and memory hogs. Are there any alternatives?
Recently Colin Walls had an article on this site (http://www.embedded.com/design/debugandoptimization/4440365/Floatingpointdatainembeddedsoftware) about floating point math. Once it was common for embedded engineers to scoff at floats; many have told me they have no place in this space. That's simply wrong. The very first embedded program I wrote in 1972 or so used an 8008 to control a nearinfrared instrument. Even though we were limited to 4KB of EPROM (it was incredibly expensive then), all of the math had to be in floating point using a library written by one Cal Ohne. Amazingly, he crammed that into just 1KB of program memory. And that was on an 8 bitter with a ferociouslybad stack and very limited instruction set.
Today even MCUs sometimes have onboard floating point hardware. ST's STM32F4 parts, for instance, have this feature and some are under four bucks in 1000 piece lots.
But most working in the microcontroller space don't have hardware floating point, so have to use a software solution, which is slow and consumes a lot of memory.
We use integers because they are fast and convenient. But the dynamic range is limited and fractional arithmetic impossible. Floats give us enormous ranges but suffer from performance. An alternative, well known to DSP developers, is fixed point math.
Integers, of course, look like this (ignoring a sign bit):
Can you see the binary point? It's not shown, but there is one all the way to the right of the 20 bit.
Suppose we move that binary point four bits to the left. Now, in the same 16 bit word, the format looks like this:
The number stored in this format is the 16 bit integer value times 24. So if the word looks like 11528 (decimal) it's really 720.5, because 720.5= 11528 x 24.
Obviously, we lose some range; the biggest number expressible is smaller than devoting those 16 bits to ints. But we gain precision.
Here's where the magic comes in: to add two fixed point numbers one just does a normal integer addition. Multiplication is little more than multiplies plus shifting. Yes, there are some special cases one must watch for, but math is very fast compared to floats. So if you need fractional values and speed, fixed point might be your new best friend. (These algorithms are well documented and not worth repeating here, but two references are Embedded Systems Building Blocks by Jean LaBrosse and Fixed Point Math in C by Joe Lemieux (http://www.eetimes.com/author.asp?section_id=36&doc_id=1287491)).
There's no rule about where the binary point should be; you select a precision that works for your application. "Q notation" is a standardized way of talking about where the binary point is located. Qf means there are f fractional bits (e.g., Q5 means 5 bits to the right of the binary point). Qn.f tells us there are n bits to the left of the binary point and f to the right.
Integers are wonderful: they are easy to understand and very fast. Floats add huge ranges and fractional values, but may be costly in CPU cycles. Sometimes fixed point is just the ticket when one needs a compromise between the other two formats. It's not unusual to see them used in computationallydemanding control applications, like PID loops. Add them to your tool kit!
Published September 17, 2015