About FPGASolutions

This author has not yet filled in any details.
So far FPGASolutions has created 29 blog entries.

Nallatech exhibiting at International SuperComputing 17

Nallatech, a Molex company, will showcase next generation OpenCL-programmable FPGA accelerator products for datacentre and cloud service applications at ISC17 being held in Frankfurt, Germany, June 19-23, 2017. The annual exhibition represents one of the largest gatherings of high performance computing (HPC) industry leaders and experts displaying the latest innovations.
 
ISC17 Announcement
Nallatech – ISC17 Booth C-1250 will feature hardware, software products plus design services for customers building scale out datacentres and cloud-based services leveraging FPGA technology.
“International Supercomputing is the perfect event for Nallatech to introduce the “520” – our next generation energy-efficient accelerator product featuring Intel Stratix 10 FPGAs,” said Craig Petrie, VP Business Development FPGA Solutions, Nallatech. “The OpenCL-programmable 520 delivers twice the core performance over previous-generation FPGAs with up to 70% lower power consumption. This unprecedented price-performance coupled with Nallatech’s application expertise and extensive manufacturing capabilities allow our customers to benchmark and deploy large-scale FPGA solutions within minimal cost and risk.” 
Advancements in architecture and high-level programming tools are opening doors for new FPGA use cases. For more information on how Nallatech streamlines FPGA integration and supports customers in the transition from prototyping to production, please visit www.nallatech.com
About Nallatech
Nallatech is a leading supplier of FPGA accelerated computing solutions. Since 1993, Nallatech has provided hardware, software and design services to enable customer’s success in applications including high performance computing, network processing, and real-time embedded computing.

About ISC17
ISC17 High Performance exhibition features the largest collection of HPC vendors, universities, and research organizations annually assembled in Europe. Together, they represent a level of innovation, diversity and creativity that are the hallmarks of the global HPC community. Having them all available on the same exhibition floor presents a unique opportunity for users to survey the HPC landscape and for vendors to display their latest and greatest wares.
Nallatech exhibiting at International SuperComputing 172017-06-19T12:18:12+00:00

Nallatech Officially Joins Dell Technology Partner Program

Nallatech Joins the Dell Technology Partner Program

Nallatech and Dell partner to offer High-Performance Computing in the Datacenter

Nallatech FPGA Accelerator - Molex

CAMARILLO, CA – March 11, 2017 – Nallatech, a Molex company, recently announced an official membership in Dell’s Technology Partnership Program. This new partnership will help accelerate the datacenter more efficiently than ever before.

Nallatech will continue to integrate FPGA Accelerators in Dell Servers, but this new partnership will help sway those on the fence that have not completely bought in to this model of computing. FPGA experts and newcomers will be empowered to utilize this mainstream method of FPGA algorithm development and deployment in the datacenter. With this ready-to-use solution FPGA programmers can focus completely on developing their own massively parallel and compute intensive applications while reducing power consumption and total cost of ownership.

Read more about Nallatech’s new partnership with Dell – Click here

The Dell Technology Partner Program
Nallatech is a Dell Technology Partner. The 385A and 510T FPGA OpenCL Accelerators are certified by Dell to run on Dell platforms that are specified in the above technical overview.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Nallatech Officially Joins Dell Technology Partner Program2017-03-22T13:49:36+00:00

SuperComputing 2016 – Nallatech Exhibiting

Nallatech Drives FPGA Accelerator Revolution at Supercomputing 2016

Visit booth 2214 for Deep Learning CNN demonstration using latest generation of FPGA accelerators and tools

Nallatech FPGA Accelerator - Molex

LISLE, IL – November 14, 2016Nallatech, a Molex company, will showcase FPGA solutions for high-performance computing (HPC), network acceleration and data analytics at the Supercomputing 2016 (SC16) Conference and Exhibition, November 14-17 in Salt Lake City.

“Major datacenter customers are now deploying FPGAs to achieve break-through application performance and energy-efficiency above and beyond what can be achieved using conventional CPU and GPU platforms,” said Allan Cantle, President and Founder, Nallatech. “In-depth design, application expertise and process knowledge combined with extensive manufacturing capabilities allow for quick deployments of large-scale FPGA solutions.”

A demonstration at the Nallatech booth 2214 will showcase FPGA acceleration of Convolutional Neural Networks (CNN) object classification by using a low profile Nallatech 385A™ PCIe accelerator card with a discrete Intel Arria 10 FPGA accelerator programmed using Intel’s OpenCL Software Development Kit. Built on the BVLC Caffe deep learning framework, an FPGA interface and IP accelerate processing intensive components of the algorithm. Nallatech IP is capable of processing an image through the AlexNet neural network in nine milliseconds. The Arria10-based 385A™ board has the capacity to process six CNN images in parallel allowing classification of 660 images per second.

Additionally, Nallatech will display a range of leading-edge technologies at SC16—

  • 385A™ FPGA Accelerator Card — A low profile, server-qualified FPGA card capable of accelerating energy-efficient datacenter applications. Two independent banks of SDRAM memory and two optical network ports complete the balanced architecture capable of both co-processing and latency-critical 1G/10G/40G streaming applications.
  • 385A-SoC™ System on Chip FPGA Accelerator Card – A powerful computing and I/O platform for SoC FPGA and ARM-based development and deployment across a range of application areas including HPC, image processing and network analytics. The 385A-SoC™ is capable of being used in “stand-alone” mode running Linux and software stacks on the embedded ARM processors allowing the accelerator to be used without a host server for ultimate Size, Weight and Power (SWAP) performance.
  • 510T™ Compute Acceleration Card – FPGA co-processor designed to deliver ultimate performance per watt for compute-intensive datacenter applications. A GPU-sized 16-lane PCIe Gen 3 card, the 510T™ features two Intel Arria 10 FPGAs delivering up to sixteen times the performance of the previous generation. Applications can achieve a total sustained performance of up to 3 TFlops. The 510T™ card is available with almost 300GByte/sec of peak external memory bandwidth configured as eight independent banks of DDR4. This combination, plus the FPGA’s on-chip memory bandwidth of 14.4TBytes/sec, permits dramatic new levels of performance per watt for memory-bound applications.
  • Nallatech® FPGA Accelerated Compute Node® – Develop and deploy quickly with minimal risk using a Nallatech FPGA accelerator, tools and IP delivered pre-integrated in a server of your choice.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

SuperComputing 2016 – Nallatech Exhibiting2017-03-20T10:42:47+00:00

Microsoft Bets its Future on a Reprogrammable Computer Chip

Prototype hardware for Project Catapult, a six-year effort to rebuild Microsoft’s online empire for the next wave of AI—and more. CLAYTON COTTERELL FOR WIRED

IT WAS DECEMBER 2012, and Doug Burger was standing in front of Steve Ballmer, trying to predict the future.

Ballmer, the big, bald, boisterous CEO of Microsoft, sat in the lecture room on the ground floor of Building 99, home base for the company’s blue-sky R&D lab just outside Seattle. The tables curved around the outside of the room in a U-shape, and Ballmer was surrounded by his top lieutenants, his laptop open. Burger, a computer chip researcher who had joined the company four years earlier, was pitching a new idea to the execs. He called it Project Catapult.

 

Doug Burger

Doug Burger. CLAYTON COTTERELL FOR WIRED

The tech world, Burger explained, was moving into a new orbit. In the future, a few giant Internet companies would operate a few giant Internet services so complex and so different from what came before that these companies would have to build a whole new architecture to run them. They would create not just the software driving these services, but the hardware, including servers and networking gear. Project Catapult would equip all of Microsoft’s servers—millions of them—with specialized chips that the company could reprogram for particular tasks.

But before Burger could even get to the part about the chips, Ballmer looked up from his laptop. When he visited Microsoft Research, Ballmer said, he expected updates on R&D, not a strategy briefing. “He just started grilling me,” Burger says. Microsoft had spent 40 years building PC software like Windows, Word, and Excel. It was only just finding its feet on the Internet. And it certainly didn’t have the tools and the engineers needed to program computer chips—a task that’s difficult, time consuming, expensive, and kind of weird. Microsoft programming computer chips was like Coca Cola making shark fin soup.

Project Catapult

The current incarnation of Project Catapult. CLAYTON COTTERELL FOR WIRED

Burger—trim, only slightly bald, and calmly analytical, like so many good engineers—pushed back. He told Ballmer that companies like Google and Amazon were already moving in this direction. He said the world’s hardware makers wouldn’t provide what Microsoft needed to run its online services. He said that Microsoft would fall behind if it didn’t build its own hardware. Ballmer wasn’t buying it. But after awhile, another voice joined the discussion. This was Qi Lu, who runs Bing, Microsoft’s search engine. Lu’s team had been talking to Burger about reprogrammable computer chips for almost two years. Project Catapult was more than possible, Lu said: His team had already started.

Today, the programmable computer chips that Burger and Lu believed would transform the world—called field programmable gate arrays—are here. FPGAs already underpin Bing, and in the coming weeks, they will drive new search algorithms based on deep neural networks—artificial intelligence modeled on the structure of the human brain—executing this AI several orders of magnitude faster than ordinary chips could. As in, 23 milliseconds instead of four seconds of nothing on your screen. FPGAs also drive Azure, the company’s cloud computing service. And in the coming years, almost every new Microsoft server will include an FPGA. That’s millions of machines across the globe. “This gives us massive capacity and enormous flexibility, and the economics work,” Burger says. “This is now Microsoft’s standard, worldwide architecture.”

03_wired_microsoft_0135-1

Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam CLAYTON COTTERELL FOR WIRED

This isn’t just Bing playing catch-up with Google. Project Catapult signals a change in how global systems will operate in the future. From Amazon in the US to Baidu in China, all the Internet giants are supplementing their standard server chips—central processing units, or CPUs—with alternative silicon that can keep pace with the rapid changes in AI. Microsoft now spends between $5 and $6 billion a year for the hardware needed to run its online empire. So this kind of work is “no longer just research,” says Satya Nadella, who took over as Microsoft’s CEO in 2014. “It’s an essential priority.” That’s what Burger was trying to explain in Building 99. And it’s what drove him and his team to overcome years of setbacks, redesigns, and institutional entropy to deliver a new kind of global supercomputer.

A Brand New, Very Old Kind of Computer Chip
In December of 2010, Microsoft researcher Andrew Putnam had left Seattle for the holidays and returned home to Colorado Springs. Two days before Christmas, he still hadn’t started shopping. As he drove to the mall, his phone rang. It was Burger, his boss. Burger was going to meet with Bing execs right after the holiday, and he needed a design for hardware that could run Bing’s machine learning algorithms on FPGAs.

Putnam pulled into the nearest Starbucks and drew up the plans. It took him about five hours, and he still had time for shopping.

Burger, 47, and Putnam, 39, are both former academics. Burger spent nine years as a professor of computer science at the University of Texas, Austin, where he specialized in microprocessors and designed a new kind of chip called EDGE. Putnam had worked for five years as a researcher at the University of Washington, where he experimented with FPGAs, programmable chips that had been around for decades but were mostly used as a way of prototyping other processors. Burger brought Putnam to Microsoft in 2009, where they started exploring the idea that these chips could actually accelerate online services.

04_wired_microsoft_0330-crop1-768x1024

Project Catapult Version 1, or V1, the hardware that Doug Burger and team tested in a data center on Microsoft’s Seattle campus. CLAYTON COTTERELL FOR WIRED

Even their boss didn’t buy it. “Every two years, FGPAs are ‘finally going to arrive,’” says Microsoft Research vice president Peter Lee, who oversees Burger’s group. “So, like any reasonable person, I kind of rolled my eyes when this was pitched.” But Burger and his team believed this old idea’s time had come, and Bing was the perfect test case.

Microsoft’s search engine is a single online service that runs across thousands of machines. Each machine is driven by a CPU, and though companies like Intel continue to improve them, these chips aren’t keeping pace with advances in software, in large part because of the new wave in artificial intelligence. Services like Bing have outstripped Moore’s Law, the canonical notion that the number of transistors in a processor doubles every 18 months. Turns out, you can’t just throw more CPUs at the problem.

But on the other hand, it’s generally too expensive to create specialized, purpose-built chips for every new problem. FPGAs bridge the gap. They let engineers build chips that are faster and less energy-hungry than an assembly-line, general-purpose CPU, but customizable so they handle the new problems of ever-shifting technologies and business models.

At that post-holiday meeting, Burger pitched Bing’s execs on FPGAs as a low-power way of accelerating searches. The execs were noncommittal. So over the next several months, Burger and team took Putnam’s Christmas sketch and built a prototype, showing that it could run Bing’s machine learning algorithms about 100 times faster. “That’s when they really got interested,” says Jim Larus, another member of the team back then who’s now a dean at Switzerland’s École Polytechnique Fédérale in Lausanne. “They also started giving us a really hard time.”

The prototype was a dedicated box with six FPGAs, shared by a rack full of servers. If the box went on the frizz, or if the machines needed more than six FPGAs—increasingly likely given the complexity of the machine learning models—all those machines were out of luck. Bing’s engineers hated it. “They were right,” Larus says.

So Burger’s team spent many more months building a second prototype. This one was a circuit board that plugged into each server and included only one FPGA. But it also connected to all the other FPGA boards on all the other servers, creating a giant pool of programmable chips that any Bing machine could tap into.

That was the prototype that got Qi Lu on board. He gave Burger the money to build and test over 1,600 servers equipped with FPGAs. The team spent six months building the hardware with help from manufacturers in China and Taiwan, and they installed the first rack in an experimental data center on the Microsoft campus. Then, one night, the fire suppression system went off by accident. They spent three days getting the rack back in shape—but it still worked.

Over several months in 2013 and 2014, the test showed that Bing’s “decision tree” machine-learning algorithms ran about 40 times faster with the new chips. By the summer of 2014, Microsoft was publicly saying it would soon move this hardware into its live Bing data centers. And then the company put the brakes on.

Searching for More Than Bing
Bing dominated Microsoft’s online ambitions in the early part of the decade, but by 2015 the company had two other massive online services: the business productivity suite Office 365 and the cloud computing service Microsoft Azure. And like all of their competitors, Microsoft executives realized that the only efficient way of running a growing online empire is to run all services on the same foundation. If Project Catapult was going to transform Microsoft, it couldn’t be exclusive to Bing. It had to work inside Azure and Office 365, too.

The problem was, Azure executives didn’t care about accelerating machine learning. They needed help with networking. The traffic bouncing around Azure’s data centers was growing so fast, the service’s CPUs couldn’t keep pace. Eventually, people like Mark Russinovich, the chief architect on Azure, saw that Catapult could help with this too—but not the way it was designed for Bing. His team needed programmable chips right where each server connected to the primary network, so they could process all that traffic before it even got to the server.

05_microsoft_chipsc

The first prototype of the FPGA architecture was a single box shared by a rack of servers (Version 0). Then the team switched to giving individual servers their own FPGAs (Version 1). And then they put the chips between the servers and the overall network (Version 2).Click to Open Overlay Gallery

So the FPGA gang had to rebuild the hardware again. With this third prototype, the chips would sit at the edge of each server, plugging directly into the network, while still creating pool of FPGAs that was available for any machine to tap into. That started to look like something that would work for Office 365, too. Project Catapult was ready to go live at last.

Larus describes the many redesigns as an extended nightmare—not because they had to build a new hardware, but because they had to reprogram the FPGAs every time. “That is just horrible, much worse than programming software,” he says. “Much more difficult to write. Much more difficult to get correct.” It’s finicky work, like trying to change tiny logic gates on the chip.

Now that the final hardware is in place, Microsoft faces that same challenge every time it reprograms these chips. “It’s a very different way of seeing the world, of thinking about the world,” Larus says. But the Catapult hardware costs less than 30 percent of everything else in the server, consumes less than 10 percent of the power, and processes data twice as fast as the company could without it.

The rollout is massive. Microsoft Azure uses these programmable chips to route data. On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets. And according to one Microsoft employee, Office 365 is moving toward using FPGAs for encryption and compression as well as machine learning—for all of its 23.1 million users. Eventually, Burger says, these chips will power all Microsoft services.

Wait—This Actually Works?
“It still stuns me,” says Peter Lee, “that we got the company to do this.” Lee oversees an organization inside Microsoft Research called NExT, short for New Experiences and Technologies. After taking over as CEO, Nadella personally pushed for the creation of this new organization, and it represents a significant shift from the 10-year reign of Ballmer. It aims to foster research that can see the light of day sooner rather than later—that can change the course of Microsoft now rather than years from now. Project Catapult is a prime example. And it is part of a much larger change across the industry. “The leaps ahead,” Burger says, “are coming from non-CPU technologies.”

Peter Lee

Peter Lee. Peter Lee. CLAYTON COTTERELL FOR WIRED

All the Internet giants, including Microsoft, now supplement their CPUs with graphics processing units, chips designed to render images for games and other highly visual applications. When these companies train their neural networks to, for example, recognize faces in photos—feeding in millions and millions of pictures—GPUs handle much of the calculation. Some giants like Microsoft are also using alternative silicon to execute their neural networks after training. And even though it’s crazily expensive to custom-build chips, Google has gone so far as to design its own processor for executing neural nets, the tensor processing unit.

With its TPUs, Google sacrifices long-term flexibility for speed. It wants to, say, eliminate any delay when recognizing commands spoken into smartphones. The trouble is that if its neural networking models change, Google must build a new chip. But with FPGAs, Microsoft is playing a longer game. Though an FPGA isn’t as fast as Google’s custom build, Microsoft can reprogram the silicon as needs change. The company can reprogram not only for new AI models, but for just about any task. And if one of those designs seems likely to be useful for years to come, Microsoft can always take the FPGA programming and build a dedicated chip.

 

 

 

07_wired_microsoft_0413-1-1024x819

A newer version of the final hardware, V2, a card that slots into the end of each Microsoft server and connects directly to the network. CLAYTON COTTERELL FOR WIRED

Microsoft’s services are so large, and they use so many FPGAs, that they’re shifting the worldwide chip market. The FPGAs come from a company called Altera, and Intel executive vice president Diane Bryant tells me that Microsoft is why Intel acquired Altera last summer—a deal worth $16.7 billion, the largest acquisition in the history of the largest chipmaker on Earth. By 2020, she says, a third of all servers inside all the major cloud computing companies will include FPGAs.

It’s a typical tangle of tech acronyms. CPUs. GPUs. TPUs. FPGAs. But it’s the subtext that matters. With cloud computing, companies like Microsoft and Google and Amazon are driving so much of the world’s technology that those alternative chips will drive the wider universe of apps and online services. Lee says that Project Catapult will allow Microsoft to continue expanding the powers of its global supercomputer until the year 2030. After that, he says, the company can move toward quantum computing.

Later, when we talk on the phone, Nadella tells me much the same thing. They’re reading from the same Microsoft script, touting a quantum-enabled future of ultrafast computers. Considering how hard it is to build a quantum machine, this seems like a pipe dream. But just a few years ago, so did Project Catapult.

Correction: This story originally implied that the Hololens headset was part of Microsoft’s NExT organization. It was not.

 

Microsoft Bets its Future on a Reprogrammable Computer Chip2017-03-20T10:42:47+00:00

FPGA Acceleration of Convolutional Neural Networks

DOWNLOAD WHITEPAPER: FPGA Accelerated CNN

Introduction – CNN – Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been shown to be extremely effective at complex image recognition problems. This white paper discusses how these networks can be accelerated using FPGA accelerator products from Nallatech, programmed using the Altera OpenCL Software Development Kit. This paper then describes how image categorization performance can be significantly improved by reducing computation precision. Each reduction in precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center and by community contributors.

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:
• Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
• Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
• Rectified-Linear: Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
• InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network
Figure 1 : Example illustration of a typical CNN – Convolutional Neural NetworkTo access the accelerated FPGA version of the code the user need only change the description of the CNN layer in the Caffe XML network description file to target the FPGA equivalent.

AlexNet

Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : AlexNet CNN – Convolutional Neural Network

AlexNet is a well know and well used network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the AlexNet CNN, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the AlexNet CNN. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. To create different layers optimized for the different convolution layers would be inefficient. This is because the computational time of each layer differs depending upon the number of filters applied and the size of the input images. due to the number of input and output features processed. However, each convolution requires a different number of layers and a different number of pixels to process. By increasing the resource applied to more compute intensive layers, each layer can be balanced to complete in the same amount of time. Hence, it is therefore possible to create a pipelined process that can have several images in flight at any one time maximizing the efficiency of the logic used. I.e. most processing elements are busy most of the time.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 1 : ImageNet layer computation requirements

Table 1 shows the computation required for each layer of the Imagenet network. From this table it can be seen that the 5×5 convolution layer requires more compute than the other layers. Therefore, more processing logic for the FPGA will be required for this layer to be balanced with the other layers.

The inner product layers have a n to n mapping requiring a unique coefficient for each multiply add. Inner product layers usually require significantly less compute than convolutional layers and therefore require less parallelization of logic. In this scenario it makes sense to move the Inner Product layers onto the host CPU, leaving the FPGA to focus on convolutions.

FPGA logic areas

FPGA devices have two processing regions, DSP and ALU logic. The DSP logic is dedicated logic for multiply or multiply add operators. This is because using ALU logic for floating point large (18×18 bits) multiplications is costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating pointer operations. To increase the performance for CNN processing it is necessary to increase the number of multiplications that be implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 and Stratix 10 devices from Altera have embedded floating DSP blocks that can also be used as fixed point multiplications. Each DSP component can in fact be used as two separated 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 3 : Arria 10 floating point DSP configuration

Figure 3 : Arria 10 floating point DSP configuration

If a reduced precision floating point processing is required it is possible to use half precision. This requires additional logic from the FPGA fabric, but doubles the number of floating point calculations possible, assuming the lower bit precision is still adequate.

One of the key advantages of the pipeline approach described in this white paper is ability to vary accuracy at different stages of the pipeline. Therefore, resources are only used where necessary, increasing the efficiency of the design.

Figure 4 : Arria 10 fixed point DSP configuration


Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNNs application tolerance, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU logic. This doubles the number of multiplications possible compared to just using the FPGA DSP logic. Some networks maybe tolerant to even lower bit precision. The FPGA can handle all precisions down to a single bit if necessary.

For the CNN layers used by AlexNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than a 1% error versus a single precision floating point operation.

CNN convolution layers

Using a sliding window technique, it is possible to create convolution kernels that are extremely light on memory bandwidth.

Figure 5 : Sliding window for 3x3 convolution

Figure 5 : Sliding window for 3×3 convolution

Figure 5 illustrates how data is cached in FPGA memory allowing each pixel to be reused multiple times. The amount of data reuse is proportional to the size of the convolution kernel.

As each input layer influences all output layers in a CNN convolution layer it is possible to process multiple input layers simultaneously. This would increase the external memory bandwidth required for loading layers. To mitigate the increase all data, except for coefficients, is stored in local M20K memory on the FPGA device. The amount on chip memory on the device limits the number of CNN layers that can be implemented.

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Most CNN features will fit within a single M20K memory and with thousands of M20Ks embedded in the FPGA fabric, the total memory bandwidth available for convolution features in parallel is in the order of 10’s Terabytes/sec.

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Depending upon the amount of M20K resource available it is not always possible to fit a complete network on a single FPGA. In this situation, multiple FPGA’s can be connected in series using high speed serial interconnects. This allows the network pipeline to be extended until sufficient resource is available.
A key advantage to this approach is it does not rely on batching to maximize performance, therefore the latency is very low, important for latency critical applications.

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Balancing the time taken between layers to be the same requires adjusting the number of parallel input layers implemented and the number of pixels processed in parallel.

Figure 9: Resources for 5x5 convolution layer of Alexnet

Figure 9: Resources for 5×5 convolution layer of Alexnet

Figure 9 lists the resources required for the 5×5 convolution layer of Alexnet with 48 parallel kernels, for both a single precision and 16 bit fixed point version on an Intel Arria10 FPGA. The numbers include the OpenCL board logic, but illustrate the benefits of lower precision has on resource.

Fully Connected Layer
Processing of a fully connected layer requires a unique coefficient for each element and therefore quickly becomes memory bound with increasing parallelism. The amount of parallelism required to keep pace with convolutional layers would quickly saturate the FPGA’s off chip memory, therefore it is proposed that his stage of the input layers either batched or pruned.

As the number of elements for an inner product layer is small the amount of storage required for batching is small versus the storage required for the convolution layers. Batching layers then allows the same coefficient to be used for each batched layer reducing the external memory bandwidth.

Pruning works by studying the input data and ignoring values below a threshold. As fully connected layers are placed at the later stages of a CNN network, many possible features have already been eliminated. Therefore, pruning can significantly reduce the amount of work required.

Resource
The key resource driver of the network is the amount of on chip M20K memories available to store the outputs of each layer. This is constant and independent of the amount of parallelism achieved. Extending the network over multiple FPGA’s increases the total amount of M20K memory available and therefore the depth of the CNN that can be processed.

Conclusion
The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of FPGA implementation allows single frame latency for object recognition, ideal for situations where low latency is crucial. E.g. object avoidance.

Using this approach for AlexNet (single precision for layer 1, then using 16 bit fixed for remaining layers), each image can be processed in ~1.2 milliseconds with a single Arria 10 FPGA, or 0.58 milliseconds with two FPGAs in series.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Convolutional Neural Networks2018-06-25T15:27:26+00:00

Molex Acquires Interconnect Systems, Inc

Acquisition strengthens Molex advanced high performance computing solution offering


LISLE, IL – April 7, 2016
– Molex, a leading global manufacturer of electronic solutions, announced today the acquisition of Interconnect Systems, Inc. (“ISI”) which specializes in the design and manufacture of high density silicon packaging with advanced interconnect technologies.

According to Tim Ruff, senior vice president, Molex, the acquisition enables Molex to offer a wider range of fully integrated solutions to customers worldwide. “We are excited about the unique capabilities and technologies the ISI team brings to Molex. ISI’s proven expertise in high-density chip packaging strengthens our platform for growth in existing markets and opens doors to new opportunities.”

Headquartered in Camarillo, California, ISI delivers advanced packaging and interconnect solutions to top-tier OEMs in a wide range of industries and technology markets, including aerospace & defense, industrial, data storage and networking, telecom, and high performance computing. ISI uses a multi-discipline customized approach to improve solution performance, reduce package size, and expedite time-to-market for customers.

“We are thrilled to join forces with Molex. By combining respective strengths and leveraging their global manufacturing footprint, we can more efficiently and effectively provide customers with advanced technology platforms and top-notch support services, while scaling up to higher volume production,” said Bill Miller, president, ISI.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Molex Resources:

Learn more about Molex at http://www.molex.com
Follow us at www.twitter.com/molexconnectors
Watch our videos at www.youtube.com/molexconnectors
Connect with us at www.facebook.com/molexconnectors
Read our blog at www.connector.com

Molex Acquires Interconnect Systems, Inc2017-03-22T14:09:08+00:00

How FPGA Boards Can Take On GPUs And Knights Landing

March 17, 2016 Timothy Prickett Morgan

Nallatech 510T FPGA boardNallatech doesn’t make FPGAs, but it does have several decades of experience turning FPGAs into an FPGA board, including devices and systems that companies can deploy to solve real-world computing problems without having to do the systems integration work themselves.

With the formerly independent Altera, now part of Intel, shipping its Arria 10 FPGAs, Nallatech has engineered a new coprocessor FPGA board that will allow FPGAs to keep pace with current and future Tesla GPU accelerators from Nvidia and “Knights Landing” Xeon Phi processors and coprocessors from Intel. The architectures of the devices share some similarities, and that is no accident because all HPC applications are looking to increase memory bandwidth and find the right mix of compute, memory capacity, and memory bandwidth to provide efficient performance on parallel applications.

Like the Knights Landing Xeon Phi, the new 510T FPGA board uses a mix of standard DDR4 and Hybrid Memory Cube (HMC) memory to provide a mix of high bandwidth, low capacity memory with high capacity, relatively low bandwidth memory to give an overall performance profile that is better than a mix of FPGAs and plain DDR4 together on the same card. (We detailed the Knights Landing architecture a year ago and updated the specs on the chip last fall. The specs on the future “Pascal” Tesla GPU accelerators, such as we know them, are here.)

In the case of the 510T FPGA board from Nallatech, the compute element is a pair of Altera Arria 10 GX 1150 FPGAs, which are etched in 20 nanometer processes from foundry partner Taiwan Semiconductor Manufacturing Corp. The higher-end Stratix 10 FPGAs are made using Intel’s 14 nanometer processes and pack a lot more punch with up to 10 teraflops per device, but they are not available yet. Nallatech is creating FPGA board coprocessors that will use these future FPGAs. But for a lot of workloads, as Nallatech president and founder Allan Cantle explains to The Next Platform, the compute is not as much of an issue as memory bandwidth to feed that compute. Every workload is different, so that is no disrespect to the Stratix 10 devices, but rather a reflection of the key oil and gas customers that Nallatech engaged with to create the 510T FPGA board.

“In reality, these seismic migration algorithms need huge amounts of compute, but fundamentally, they are streaming algorithms and they are memory bound,” says Cantle. “When we looked at this for one of our customers, who was using Tesla K80 GPU accelerators, somewhere between 5 percent and 10 percent of the available floating point performance was actually being used and 100 percent of the memory bandwidth was consumed. That Tesla K80 with dual GPUs has 24 GB of memory and 480 GB/sec of aggregate memory bandwidth across those GPUs, and it has around 8.7 teraflops of peak single precision floating point capability. We have two Arria 10s, which are rated at 1.5 teraflops each, which is just around 3 teraflops total but I think the practical upper limit is 2 teraflops., but that is just my personal take. But when you look at it, you only need 400 gigaflops to 800 gigaflops, making very efficient use of the FPGA’s available flops, which you cannot do on a GPU.”

The issue, says Cantle, is that the way the GPU implements the streaming algorithm at the heart of the seismic migration application that is used to find oil buried underground, it makes many accesses to the GDDR5 memory in the GPU card, which is what is burning up all of the memory bandwidth. “The GPU consumes its memory bandwidth quite quickly because you have to come off chip the way the math is done,” Cantle continues. “The opportunity with the FPGA board is to make this into a very deep pipeline and to minimize the amount of time you go into global memory.”

Nallatech 510T FPGA board

The trick that Nallatech is using is putting a block of HMC memory between the two FPGAs on the FPGA board, which is a fast, shared memory space that the two FPGAs can actually share and address at the same time. The 510T FPGA board is one of the first compute devices (rather than networking or storage devices) that is implementing HMC memory, which has been co-developed by Micron Technology and Intel, and it is using the second generation of HMC to be precise. (Nallatech did explore first generation HMC memory on FPGA accelerators for unspecified government customers, but this was not commercially available as the 510T FPGA board is.)

In addition to memory bandwidth bottlenecks, seismic applications used in the oil and gas industry also have memory capacity issues. The larger the memory that the compute has access to, the larger the volume (higher number of frequencies) that the seismic simulation can run. With the memory limit on a single GPU, says Cantle, this particular customer was limited to approximately 800 volumes (it is actually a cube). Oil and gas customers would love to be able to do 4K volumes (again cubed), but that would require about 2 TB of memory to do.

So the 510T FPGA board has four ports of DDR4 main memory to supply capacity to store more data to do the larger and more complex seismic analysis, and by ganging up 16 FPGA boards together across hybrid CPU-FPGA nodes, Nallatech can break through that 4K volumes barrier and reach the level of performance that oil and gas companies are looking for.

Here is the block diagram of the 510T FPGA board:

510T FPGA board diagram

The HMC memory comes in 2 GB capacity, with 4 GB being optional, and has separate read and write ports, each of which deliver 30 GB/sec of peak bandwidth per FPGA on the card. The four ports of DDR4 memory that link to the other side of the FPGAs deliver 32 GB of capacity per FPGA (with an option of 64 GB per FPGA) and 85 GB/sec of peak bandwidth. So each card has 290 GB/sec of aggregate bandwidth and 132 GB of memory for the applications to play in.

These FPGA boards slide into a PCI-Express x16 slot, and in fact, Nallatech has worked with server maker Dell to put these into a custom, high-end server that can put four of these FPGA boards and two Xeon E5 processors into a single 1U rack-mounted server. The Nallatech 510T cards cost $13,000 each at list price, and the cost of a server with four of these plus an OpenCL software development kit and the Altera Quartus Prime Pro FPGA design software added to is $60,000.

Speaking very generally, the two-FPGA boards can deliver about 1.3X the performance of the Tesla K80 running the seismic codes at this oil and gas customer in about half the power envelope, says Cantle and there is a potential upside of 10X performance for customers that have larger volume datasets or who are prepared to optimize their algorithms to leverage the strengths of the FPGA. But Nallatech also knows that FPGAs are more difficult to program than GPUs at this point, and is being practical about the competitive positioning.

“At the end of the day, everyone needs to be a bit realistic here,” says Cantle. “In terms of price/performance, FPGA cards do not sell in the volumes of GPU boards, so we hit a price/performance limit for these types of algorithms. The idea here is to prove that today’s FPGAs are competent at what GPUs are great at. For oil and gas customers, it makes sense for companies to weigh this up. Is it a slam dunk? I can’t say that. But if you are doing bit manipulation problems – compression, encryption, bioinformatics – it is a no brainer that the FPGA is far better – tens of times faster – than the GPU. There will be places where the FPGA will be a slam dunk, and with Intel’s purchase of Altera, their future is certainly bright.”

The thing we observe is that companies will have to not look just at raw compute but how their models can scale across the various memory in a compute element and across multiple elements lashed together inside of a node and across nodes.

Source: TheNextPlatform

How FPGA Boards Can Take On GPUs And Knights Landing2017-03-20T10:42:47+00:00

Low Latency Key-Value Store / High Performance Data Center Services

Gateware Defined Networking® (GDN) – Search Low Latency Key-Value Store

Low Latency Key-Value Store (KVS) is an essential service for multiple applications. Telecom directories, Internet Protocol forwarding tables, and de-duplicating storage systems, for example, all need key-value tables to associate data with uniqu e identifiers. In datacenters, high performance KVS tables allow hundreds or thousands of machines to easily share data by simply associating values with keys and allowi ng client machines to read and write those keys and values over standard high-speed Ethernet.

Algo-Logic’s KVS leverages Gateware Defined Netw  orking® (GDN) on Field Programmable Gate Arrays (FPG As) to perform lookups with the lowest latency (less than  1 microsecond), with the highest throughput, and the least processing energy.   Deploying GDN solutions save netw ork operators’ time, cost, and power resulting in signifi cantly lower Total Cost of Ownership (TCO). DOWNLOAD FULL REPORT

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Data centers require many low-level network services to implement high-level applications. Key-Value Store (KVS) is a critical service that associates values with keys and allows machines to share these associations over a network. Most existing KVS systems run in software and scale out by running parallel processes on multiple microprocessor cores to increase throughput.

In this paper, we take an alternate approach by implementing an ultra-low-latency KVS in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, lookup transactions are sent over Ethernet to the machine that stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher search throughput with much lower latency and power consumption than other implementations in software. DOWNLOAD FULL REPORT

Low Latency - KVS Diagram
Low Latency - KVS PacketHandler
Low Latency Key-Value Store / High Performance Data Center Services2017-03-20T10:42:47+00:00

International Supercomputing – ISC 2015 – New Product Release and FPGA Industry Overview…

Craig Petrie of Nallatech was at International Supercomputing Conference / ISC 2015, in Frankfurt, Germany, where he talked about the new Nallatech products, the 385A and 510T, as well as the state of the industry for FPGA development.

ISC 2015 - International Supercomputing Conference

View 385A
View 510T
International Supercomputing – ISC 2015 – New Product Release and FPGA Industry Overview…2017-03-20T10:42:47+00:00
Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.