Altera Kicks Up Floating Point

by Kevin Morris


The Cray-2, the world’s fastest computer until about 1990, was capable of almost 2 GigaFLOPS (Billion Floating Point Operations per Second) – at an inflation-adjusted price of over $30 million. A decade later, ASCI Red – selling for a cool $70 million or so – topped one teraFLOPS (Trillion Floating Point Operations per Second). The machine was twice as expensive, but the price per performance had dropped from ~$15M/GFLOPS (Cray) to ~$70K/GFLOPS (ASCI Red). That’s a shocking improvement. Moore’s Law would have us believe in a ~32x gain over the course of a decade, but real-world supercomputers delivered over 200x in just ten years. Take that, Dr Moore!

Sometime in 2015, according to Altera, we will have a single FPGA (yep, that’s right, one chip) – designed by Altera and manufactured by Intel – capable of approximately TEN teraFLOPS. Let’s do some math on that, shall we? We don’t know exactly what a Stratix 10 FPGA will cost, but it almost doesn’t matter. This device should put us in the realm of $1/GFLOPS. Or, compared to ASCI Red, an additional 70,000x improvement in cost per performance. Compared to the 1990’s Cray-2 (a quarter century earlier), That’s a 15,000,000x improvement – in a time span when an optimistic interpretation of Moore’s Law says we should have less than 10,000x improvement. This is all very fuzzy math, but it appears that high-performance computing will have outpaced Moore’s Law by some 1,500x since 1990.

Whoa!

Now, before you all pull out your slide-rules and start shouting about everything from our underlying assumptions to Altera’s marketing “techniques,” let’s see what’s changed to make that possible. We all know that the underlying technology – semiconductors – have tracked pretty straight with Moore’s Law (if you live in a bizarre logarithmic land where you count a 50-year exponential as “straight”). That means our computing hardware has made some serious gains in places other than the number of transistors packed onto a single die.

What kinds of engineering innovation give us this extra three orders of magnitude of “goodness”? The case we’re examining – the most recent innovation announced just this week – is Altera’s hardening of their floating point arithmetic units. IEEE 754 Single Precision Floating Point is now fully supported in optimized hardware – in the DSP blocks of both the current Arria 10 and in the upcoming Stratix 10 FPGAs. This brings a major performance boost to floating point applications targeting FPGAs.

Hey there Horace, haven’t hardware multipliers been around for at least three decades?

Yes they have. Even back in the 1980s, the venerable 8086 shamelessly rode the coattails of its lesser-known but harder-working sibling, the 8087, to floating point fame and fortune. What Altera has done, however, is to combine the fine-grained massively-parallel capabilities of a modern FPGA with a very large number of floating-point-capable DSP blocks. While FPGAs have been routing von Neumann processors for years on fixed-point datapath throughput, their supercomputing achilles heel was always their floating point architecture (or, more precisely, their lack thereof).

Modern FPGAs contain sometimes thousands of DSP units. You can construct a massively parallel datapath/controller architecture using the FPGA fabric that can significantly outperform even the fastest DSP processors in big math-crunching algorithms. Even more significant is the extreme power savings of an FPGA-based implementation compared with a software solution executed by conventional processors. Numerous benchmarks have demonstrated the superiority of FPGAs compared to DSPs, conventional processors, and even GPUs for datapath-oriented computing – both in raw performance and in computational power efficiency.

However, there have always been two major barriers to the adoption of FPGAs for high-performance computing. First is the difficulty of programming. Where a conventional processor or a DSP requires software expertise in a high-level language like C++ (or, even FORTRAN, believe it or not, for some high-performance computing projects), FPGAs have always required a background in digital hardware design and fluency in a hardware description language such as VHDL or Verilog. This means that getting your algorithm running on an FPGA has historically required adding a hard-to-find hardware/FPGA guru to your team and a few months to your schedule, and those are two luxuries that many teams do not have.

Altera’s solution to the programming challenge is an elegant one. Since the emergence of GPUs as high-performance computing platforms and the explosion of languages like Nvidia’s CUDA or the Apple-developed (but now open) OpenCL, software engineers have been moving closer to the task of defining explicit parallelism in their code. Altera met those OpenCL programmers more than halfway by providing a design flow that maps OpenCL directly to hardware on Altera FPGAs. If you’re already writing OpenCL implementations of your algorithm to run on GPUs, you can take that same code and target it to FPGAs – with reportedly outstanding results.

The caveat on that OpenCL flow (until now) has been floating-point math. Since the DSP blocks on FPGAs have always been fixed-point, floating point arithmetic required going outside the DSP blocks and implementing the logic in FPGA LUT fabric. While this was still a “hardware” implementation, it was much less power- and logic-efficient than a custom-designed hardware floating-point unit. With this announcement, Altera has plugged that gap – bringing fully-optimized hardened single-precision floating point to their DSP blocks.

Apparently, these nifty hardened floating point units have already been hiding in Altera’s Arria 10 FPGAs – just waiting for support in the design tools. Now, when design tool support is turned on, Altera’s 20nm, TSMC-fabbed Arria 10 FPGAs will suddenly be capable of up to 1,500 GFLOPS. This performance can be tapped via the OpenCL flow, the DSPBuilder flow, or even old-school with “FP Megafunctions” instantiated in your HDL code.

Where this gets really interesting, however, is with Altera’s upcoming Stratix 10 family – based on Intel’s 14nm Tri-Gate (FinFET) process. With Stratix 10, Altera claims they’ll have up to ten teraFLOPS performance in a single FPGA. That’s staggering by any standard, and we should have it sometime in 2015.

It is perhaps appropriate at this point to debunk some of the derisive rumors being manufactured and spread by one of the industry’s less-reputable pay-for-play shill blogs. There is absolutely no evidence to support rumors of Altera leaving Intel and going back to TSMC for Stratix 10. On the contrary, at this moment, Altera has working test chips in house that were fabricated with Intel’s 14nm Tri-Gate process. Altera is using these test chips to validate high-speed transceivers, digital logic, and hard-IP blocks (perhaps, even hardened floating-point DSP blocks, although the company hasn’t shared that specifically). Now, maybe this is all innocent and the bloggers in question were simply “confused” because Altera is still very actively partnering with TSMC as well – on the aforementioned 20nm Arria 10 line. Or, perhaps, Altera and Intel didn’t pony up for protection from the blogger mob, so they got kneecapped with some vicious and baseless rumors. As of this writing, however, Altera and Intel are still working hard together on Stratix 10 with 14nm Tri-Gate technology – and apparently it is coming along quite nicely.

Hardening the floating point processing has the obvious advantages one would expect, plus some less-obvious ones. Of course, optimized floating-point hardware is much faster than floating-point processors built from FPGA LUT fabric. Also of course, power consumption is greatly reduced. Less obvious is the fact that, since Altera has just freed up all those FPGA logic cells that were doing floating point before (a great number of them, it turns out), we are suddenly gifted with a huge helping of extra FPGA fabric. In other words, if you were using your old FPGA for floating point, that FPGA just got a whole lot bigger.

Following onto that advantage, the old floating-point modules were some of the most difficult parts of many designs to successfully route and bring to timing closure. Now, with these hardened floating point blocks, those routes no longer need to be routed and those paths no longer need to suffer the agony of timing closure. Your design tool drama and runtimes just took a big turn in the right direction.

There is an industry significance to this announcement that is also not obvious. For decades now, FPGA companies have dueled it out for their slices of the lucrative communications infrastructure pie. While that market has always been the leading revenue generator for FPGAs, the technology is clearly applicable in many other markets and application areas. However, the requirement to have an FPGA expert on the team has thrown a wet blanket on many of those new-market opportunities. High-performance computing is clearly one of those under-served, high-potential applications for FPGAs. If FPGAs can get past a critical proof-point, a whole new market opens up. When a software engineer can write code in a programming language like OpenCL, target that code to an FPGA with equal ease to targeting that same code to something like a GPU, and get some combination of faster performance, lower cost, and lower power consumption, then we have reached our proof-point, FPGAs have a new market, and Altera is then competing with companies like Nvidia rather than their traditional rivals.

You can get started designing now with Arria 10 using any of Altera’s supported design flows. Today, your code will map to soft-core floating-point units implemented in the FPGA fabric. In the second half of this year, when Altera turns on hardened floating point support, your same design should automatically re-map to take advantage of the new hardware. Then, when Stratix 10 comes out next year, you’ll be ready to really turn up the boost. Altera says they have pin-compatible versions of Arria 10 and Stratix 10, so that migration step should be pretty seamless as well.

Source