About Tom_Robinson

This author has not yet filled in any details.
So far Tom_Robinson has created 16 blog entries.

SuperComputing 2016 – Nallatech Exhibiting

Nallatech Drives FPGA Accelerator Revolution at Supercomputing 2016

Visit booth 2214 for Deep Learning CNN demonstration using latest generation of FPGA accelerators and tools

Nallatech FPGA Accelerator - Molex

LISLE, IL – November 14, 2016Nallatech, a Molex company, will showcase FPGA solutions for high-performance computing (HPC), network acceleration and data analytics at the Supercomputing 2016 (SC16) Conference and Exhibition, November 14-17 in Salt Lake City.

“Major datacenter customers are now deploying FPGAs to achieve break-through application performance and energy-efficiency above and beyond what can be achieved using conventional CPU and GPU platforms,” said Allan Cantle, President and Founder, Nallatech. “In-depth design, application expertise and process knowledge combined with extensive manufacturing capabilities allow for quick deployments of large-scale FPGA solutions.”

A demonstration at the Nallatech booth 2214 will showcase FPGA acceleration of Convolutional Neural Networks (CNN) object classification by using a low profile Nallatech 385A™ PCIe accelerator card with a discrete Intel Arria 10 FPGA accelerator programmed using Intel’s OpenCL Software Development Kit. Built on the BVLC Caffe deep learning framework, an FPGA interface and IP accelerate processing intensive components of the algorithm. Nallatech IP is capable of processing an image through the AlexNet neural network in nine milliseconds. The Arria10-based 385A™ board has the capacity to process six CNN images in parallel allowing classification of 660 images per second.

Additionally, Nallatech will display a range of leading-edge technologies at SC16—

  • 385A™ FPGA Accelerator Card — A low profile, server-qualified FPGA card capable of accelerating energy-efficient datacenter applications. Two independent banks of SDRAM memory and two optical network ports complete the balanced architecture capable of both co-processing and latency-critical 1G/10G/40G streaming applications.
  • 385A-SoC™ System on Chip FPGA Accelerator Card – A powerful computing and I/O platform for SoC FPGA and ARM-based development and deployment across a range of application areas including HPC, image processing and network analytics. The 385A-SoC™ is capable of being used in “stand-alone” mode running Linux and software stacks on the embedded ARM processors allowing the accelerator to be used without a host server for ultimate Size, Weight and Power (SWAP) performance.
  • 510T™ Compute Acceleration Card – FPGA co-processor designed to deliver ultimate performance per watt for compute-intensive datacenter applications. A GPU-sized 16-lane PCIe Gen 3 card, the 510T™ features two Intel Arria 10 FPGAs delivering up to sixteen times the performance of the previous generation. Applications can achieve a total sustained performance of up to 3 TFlops. The 510T™ card is available with almost 300GByte/sec of peak external memory bandwidth configured as eight independent banks of DDR4. This combination, plus the FPGA’s on-chip memory bandwidth of 14.4TBytes/sec, permits dramatic new levels of performance per watt for memory-bound applications.
  • Nallatech® FPGA Accelerated Compute Node® – Develop and deploy quickly with minimal risk using a Nallatech FPGA accelerator, tools and IP delivered pre-integrated in a server of your choice.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Microsoft Bets its Future on a Reprogrammable Computer Chip

Prototype hardware for Project Catapult, a six-year effort to rebuild Microsoft’s online empire for the next wave of AI—and more. CLAYTON COTTERELL FOR WIRED

IT WAS DECEMBER 2012, and Doug Burger was standing in front of Steve Ballmer, trying to predict the future.

Ballmer, the big, bald, boisterous CEO of Microsoft, sat in the lecture room on the ground floor of Building 99, home base for the company’s blue-sky R&D lab just outside Seattle. The tables curved around the outside of the room in a U-shape, and Ballmer was surrounded by his top lieutenants, his laptop open. Burger, a computer chip researcher who had joined the company four years earlier, was pitching a new idea to the execs. He called it Project Catapult.


Doug Burger


The tech world, Burger explained, was moving into a new orbit. In the future, a few giant Internet companies would operate a few giant Internet services so complex and so different from what came before that these companies would have to build a whole new architecture to run them. They would create not just the software driving these services, but the hardware, including servers and networking gear. Project Catapult would equip all of Microsoft’s servers—millions of them—with specialized chips that the company could reprogram for particular tasks.

But before Burger could even get to the part about the chips, Ballmer looked up from his laptop. When he visited Microsoft Research, Ballmer said, he expected updates on R&D, not a strategy briefing. “He just started grilling me,” Burger says. Microsoft had spent 40 years building PC software like Windows, Word, and Excel. It was only just finding its feet on the Internet. And it certainly didn’t have the tools and the engineers needed to program computer chips—a task that’s difficult, time consuming, expensive, and kind of weird. Microsoft programming computer chips was like Coca Cola making shark fin soup.

Project Catapult

The current incarnation of Project Catapult. CLAYTON COTTERELL FOR WIRED

Burger—trim, only slightly bald, and calmly analytical, like so many good engineers—pushed back. He told Ballmer that companies like Google and Amazon were already moving in this direction. He said the world’s hardware makers wouldn’t provide what Microsoft needed to run its online services. He said that Microsoft would fall behind if it didn’t build its own hardware. Ballmer wasn’t buying it. But after awhile, another voice joined the discussion. This was Qi Lu, who runs Bing, Microsoft’s search engine. Lu’s team had been talking to Burger about reprogrammable computer chips for almost two years. Project Catapult was more than possible, Lu said: His team had already started.

Today, the programmable computer chips that Burger and Lu believed would transform the world—called field programmable gate arrays—are here. FPGAs already underpin Bing, and in the coming weeks, they will drive new search algorithms based on deep neural networks—artificial intelligence modeled on the structure of the human brain—executing this AI several orders of magnitude faster than ordinary chips could. As in, 23 milliseconds instead of four seconds of nothing on your screen. FPGAs also drive Azure, the company’s cloud computing service. And in the coming years, almost every new Microsoft server will include an FPGA. That’s millions of machines across the globe. “This gives us massive capacity and enormous flexibility, and the economics work,” Burger says. “This is now Microsoft’s standard, worldwide architecture.”


Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam CLAYTON COTTERELL FOR WIRED

This isn’t just Bing playing catch-up with Google. Project Catapult signals a change in how global systems will operate in the future. From Amazon in the US to Baidu in China, all the Internet giants are supplementing their standard server chips—central processing units, or CPUs—with alternative silicon that can keep pace with the rapid changes in AI. Microsoft now spends between $5 and $6 billion a year for the hardware needed to run its online empire. So this kind of work is “no longer just research,” says Satya Nadella, who took over as Microsoft’s CEO in 2014. “It’s an essential priority.” That’s what Burger was trying to explain in Building 99. And it’s what drove him and his team to overcome years of setbacks, redesigns, and institutional entropy to deliver a new kind of global supercomputer.

A Brand New, Very Old Kind of Computer Chip
In December of 2010, Microsoft researcher Andrew Putnam had left Seattle for the holidays and returned home to Colorado Springs. Two days before Christmas, he still hadn’t started shopping. As he drove to the mall, his phone rang. It was Burger, his boss. Burger was going to meet with Bing execs right after the holiday, and he needed a design for hardware that could run Bing’s machine learning algorithms on FPGAs.

Putnam pulled into the nearest Starbucks and drew up the plans. It took him about five hours, and he still had time for shopping.

Burger, 47, and Putnam, 39, are both former academics. Burger spent nine years as a professor of computer science at the University of Texas, Austin, where he specialized in microprocessors and designed a new kind of chip called EDGE. Putnam had worked for five years as a researcher at the University of Washington, where he experimented with FPGAs, programmable chips that had been around for decades but were mostly used as a way of prototyping other processors. Burger brought Putnam to Microsoft in 2009, where they started exploring the idea that these chips could actually accelerate online services.


Project Catapult Version 1, or V1, the hardware that Doug Burger and team tested in a data center on Microsoft’s Seattle campus. CLAYTON COTTERELL FOR WIRED

Even their boss didn’t buy it. “Every two years, FGPAs are ‘finally going to arrive,’” says Microsoft Research vice president Peter Lee, who oversees Burger’s group. “So, like any reasonable person, I kind of rolled my eyes when this was pitched.” But Burger and his team believed this old idea’s time had come, and Bing was the perfect test case.

Microsoft’s search engine is a single online service that runs across thousands of machines. Each machine is driven by a CPU, and though companies like Intel continue to improve them, these chips aren’t keeping pace with advances in software, in large part because of the new wave in artificial intelligence. Services like Bing have outstripped Moore’s Law, the canonical notion that the number of transistors in a processor doubles every 18 months. Turns out, you can’t just throw more CPUs at the problem.

But on the other hand, it’s generally too expensive to create specialized, purpose-built chips for every new problem. FPGAs bridge the gap. They let engineers build chips that are faster and less energy-hungry than an assembly-line, general-purpose CPU, but customizable so they handle the new problems of ever-shifting technologies and business models.

At that post-holiday meeting, Burger pitched Bing’s execs on FPGAs as a low-power way of accelerating searches. The execs were noncommittal. So over the next several months, Burger and team took Putnam’s Christmas sketch and built a prototype, showing that it could run Bing’s machine learning algorithms about 100 times faster. “That’s when they really got interested,” says Jim Larus, another member of the team back then who’s now a dean at Switzerland’s École Polytechnique Fédérale in Lausanne. “They also started giving us a really hard time.”

The prototype was a dedicated box with six FPGAs, shared by a rack full of servers. If the box went on the frizz, or if the machines needed more than six FPGAs—increasingly likely given the complexity of the machine learning models—all those machines were out of luck. Bing’s engineers hated it. “They were right,” Larus says.

So Burger’s team spent many more months building a second prototype. This one was a circuit board that plugged into each server and included only one FPGA. But it also connected to all the other FPGA boards on all the other servers, creating a giant pool of programmable chips that any Bing machine could tap into.

That was the prototype that got Qi Lu on board. He gave Burger the money to build and test over 1,600 servers equipped with FPGAs. The team spent six months building the hardware with help from manufacturers in China and Taiwan, and they installed the first rack in an experimental data center on the Microsoft campus. Then, one night, the fire suppression system went off by accident. They spent three days getting the rack back in shape—but it still worked.

Over several months in 2013 and 2014, the test showed that Bing’s “decision tree” machine-learning algorithms ran about 40 times faster with the new chips. By the summer of 2014, Microsoft was publicly saying it would soon move this hardware into its live Bing data centers. And then the company put the brakes on.

Searching for More Than Bing
Bing dominated Microsoft’s online ambitions in the early part of the decade, but by 2015 the company had two other massive online services: the business productivity suite Office 365 and the cloud computing service Microsoft Azure. And like all of their competitors, Microsoft executives realized that the only efficient way of running a growing online empire is to run all services on the same foundation. If Project Catapult was going to transform Microsoft, it couldn’t be exclusive to Bing. It had to work inside Azure and Office 365, too.

The problem was, Azure executives didn’t care about accelerating machine learning. They needed help with networking. The traffic bouncing around Azure’s data centers was growing so fast, the service’s CPUs couldn’t keep pace. Eventually, people like Mark Russinovich, the chief architect on Azure, saw that Catapult could help with this too—but not the way it was designed for Bing. His team needed programmable chips right where each server connected to the primary network, so they could process all that traffic before it even got to the server.


The first prototype of the FPGA architecture was a single box shared by a rack of servers (Version 0). Then the team switched to giving individual servers their own FPGAs (Version 1). And then they put the chips between the servers and the overall network (Version 2).Click to Open Overlay Gallery

So the FPGA gang had to rebuild the hardware again. With this third prototype, the chips would sit at the edge of each server, plugging directly into the network, while still creating pool of FPGAs that was available for any machine to tap into. That started to look like something that would work for Office 365, too. Project Catapult was ready to go live at last.

Larus describes the many redesigns as an extended nightmare—not because they had to build a new hardware, but because they had to reprogram the FPGAs every time. “That is just horrible, much worse than programming software,” he says. “Much more difficult to write. Much more difficult to get correct.” It’s finicky work, like trying to change tiny logic gates on the chip.

Now that the final hardware is in place, Microsoft faces that same challenge every time it reprograms these chips. “It’s a very different way of seeing the world, of thinking about the world,” Larus says. But the Catapult hardware costs less than 30 percent of everything else in the server, consumes less than 10 percent of the power, and processes data twice as fast as the company could without it.

The rollout is massive. Microsoft Azure uses these programmable chips to route data. On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets. And according to one Microsoft employee, Office 365 is moving toward using FPGAs for encryption and compression as well as machine learning—for all of its 23.1 million users. Eventually, Burger says, these chips will power all Microsoft services.

Wait—This Actually Works?
“It still stuns me,” says Peter Lee, “that we got the company to do this.” Lee oversees an organization inside Microsoft Research called NExT, short for New Experiences and Technologies. After taking over as CEO, Nadella personally pushed for the creation of this new organization, and it represents a significant shift from the 10-year reign of Ballmer. It aims to foster research that can see the light of day sooner rather than later—that can change the course of Microsoft now rather than years from now. Project Catapult is a prime example. And it is part of a much larger change across the industry. “The leaps ahead,” Burger says, “are coming from non-CPU technologies.”

Peter Lee


All the Internet giants, including Microsoft, now supplement their CPUs with graphics processing units, chips designed to render images for games and other highly visual applications. When these companies train their neural networks to, for example, recognize faces in photos—feeding in millions and millions of pictures—GPUs handle much of the calculation. Some giants like Microsoft are also using alternative silicon to execute their neural networks after training. And even though it’s crazily expensive to custom-build chips, Google has gone so far as to design its own processor for executing neural nets, the tensor processing unit.

With its TPUs, Google sacrifices long-term flexibility for speed. It wants to, say, eliminate any delay when recognizing commands spoken into smartphones. The trouble is that if its neural networking models change, Google must build a new chip. But with FPGAs, Microsoft is playing a longer game. Though an FPGA isn’t as fast as Google’s custom build, Microsoft can reprogram the silicon as needs change. The company can reprogram not only for new AI models, but for just about any task. And if one of those designs seems likely to be useful for years to come, Microsoft can always take the FPGA programming and build a dedicated chip.





A newer version of the final hardware, V2, a card that slots into the end of each Microsoft server and connects directly to the network. CLAYTON COTTERELL FOR WIRED

Microsoft’s services are so large, and they use so many FPGAs, that they’re shifting the worldwide chip market. The FPGAs come from a company called Altera, and Intel executive vice president Diane Bryant tells me that Microsoft is why Intel acquired Altera last summer—a deal worth $16.7 billion, the largest acquisition in the history of the largest chipmaker on Earth. By 2020, she says, a third of all servers inside all the major cloud computing companies will include FPGAs.

It’s a typical tangle of tech acronyms. CPUs. GPUs. TPUs. FPGAs. But it’s the subtext that matters. With cloud computing, companies like Microsoft and Google and Amazon are driving so much of the world’s technology that those alternative chips will drive the wider universe of apps and online services. Lee says that Project Catapult will allow Microsoft to continue expanding the powers of its global supercomputer until the year 2030. After that, he says, the company can move toward quantum computing.

Later, when we talk on the phone, Nadella tells me much the same thing. They’re reading from the same Microsoft script, touting a quantum-enabled future of ultrafast computers. Considering how hard it is to build a quantum machine, this seems like a pipe dream. But just a few years ago, so did Project Catapult.

Correction: This story originally implied that the Hololens headset was part of Microsoft’s NExT organization. It was not.


FPGA Acceleration of Convolutional Neural Networks

Nalllatech Whitepaper – FPGA Accelerated CNN

Introduction – CNN – Convolutional Neural Network

Convolutional Neural Network (CNN) has been shown to be extremely effective at complex image recognition problems. This white paper discusses how CNN – Convolutional Neural Network computation can be accelerated using FPGA acceleration products from Nallatech, programmed using the Altera OpenCL Software Development Kit. Image categorization performance can be optimized by adjusting computation precision. Reduction in computational precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It was developed and is maintained by the Berkeley Vision and Learning Center and by community contributors. http://caffe.berkeleyvision.org/

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN – Convolutional Neural Network. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:

  • Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output
  • Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum
  • Rectified-Linear (ReLU): Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if  x <= 0.
  • InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

  • Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network

    Figure 1 : Example illustration of a typical CNN – Convolutional Neural Network

    To access the accelerated FPGA version of the code the user need only change the description of the CNN – Convolutional Neural Network layer in the Caffe XML network description file to target the FPGA equivalent.


Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : ImageNet CNN – Convolutional Neural Network

ImageNet is a respected and well used CNN – Convolutional Neural Network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the ImageNet CNN – Convolutional Neural Network, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the ImageNet CNN – Convolutional Neural Network. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. Because the computational time of each layer differs depending upon the number of filters applied and the size of the input images, creating different layers optimized for the different convolution layers would be inefficient. To avoid this inefficiency, the smallest filter (3×3) is used as the basis for the larger convolutional blocks. The 3×3 convolutional layers are the most computationally demanding due to the number of input and output features processed by them.

The larger filter sizes can be represented as multiple passes of the smaller 3×3 filters. This adds inefficiency into the kernel processing, but allows for logic reuse between the different layers.  The cost of this approach is illustrated in Table 1.

Table 1 : Convolution kernel efficiency

Table 1 : Convolution kernel efficiency

The 3×3 convolution kernel can also be used by the fully connected layers.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 2 : ImageNet layer computation requirements when using 3×3 filters

FPGA logic areas

FPGA devices have two processing resource types, DSP and ALU logic. The DSP logic is dedicated logic optimized for large (18×18 bits) floating point multiply or multiply add operators. This is much more efficient than using ALU logic where such multiplications are costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating point operations. To increase the performance for CNN – Convolutional Neural Network processing it is necessary to increase the number of multiplications that are implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 devices from Altera have embedded floating DSP blocks that can also be used for fixed point multiplication. Each DSP component can in fact be used for two independent 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 4 : Arria 10 fixed point DSP configuration

Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNN – Convolutional Neural Network’s applications performance requirements, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU. This doubles the number of multiplications possible compared to just using the FPGA DSP logic.

OpenCL library functions

Altera has provided the ability to include user defined and optimized IP components into their compiler tool flow. This allows such optimized functions to be created and included using standard library notation. The library components allow an experienced HDL programmer to create highly efficient implementations in the same way an assembly language programmer would create and include optimized functions for x86.

For the CNN – Convolutional Neural Network layers used by ImageNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than 1% error versus a single precision floating point operation. Therefore an optimized library for a 10 bit 3×3 convolution was created. This library was then implemented (replicated) as many times as possible, limited by FPGA resource available.

Figure 5 : Arria 10 GX1150 resources

Figure 5 : Arria 10 GX1150 resources

The Arria10’s largest available device is the GX 1150. This device has resource for ~512 convolution blocks, plus the application control logic.


Increasing the number of parallel convolution kernels increases the input bandwidth requirements. To avoid global memory becoming a bottleneck, multiple images are calculated at once allowing the convolution filter weights to be reused for each different image. This is particularly important for the fully connected layers where a new set of filter weights is required for each point to point connection, with the speed at which weights are retrieved from global memory the bottleneck. Fortunately the convolution layers reuse the weight  data  for  each  point  in a feature image. The smallest convolution feature image is 13×13 pixels, therefore the convolution weights need only be updated every 169 iterations in the worst case.

Figure 6 : Nallatech 510T Accelerator

Figure 6 : Nallatech 510T Accelerator

The hardware selected for this CNN – Convolutional Neural Network implementation was the Nallatech 510T – a GPU-sized FPGA accelerator card compatible with most server platforms designed to support Intel Xeon Phi or GPGPU accelerators. The Nallatech 510T features two Altera Arria 10 GX 1150 FPGAs with ~60 GBytes/sec external memory bandwidth for loading weights, input and output data. Typical power consumption of the 510T is only 150W – less than half the power consumption of a high-end GPGPU. An added bonus of using 10 bit coefficient data for the FPGA implementation is the tripling in the amount of weight data that can be read from global memory versus floating point data.

Using the Nallatech 510T accelerator, 16 parallel images can be processed with each image having 64 kernels processed in parallel. This was achieved by generating 8 output features and 8 pixels per feature in parallel. This gives a total of 1024 parallel 3×3 kernels.

In our implementation we created an OpenCL kernel system for 1 image and replicated this as many times as possible given the FPGA resource constraints. The convolution weights are reused for each image so there is minimal increase to global memory requirements when scaling to multiple parallel images.


By applying the above FPGA system, each image takes 9 millisecs to be categorized by the FPGA . With 12 parallel images handled by 510T this gives an average time of 748 usecs per image. This is over 115 million images per day.

Figure 7 : Images categorized per second. Nalllatech 510T versus Nvidia K401

Figure 7 : Images categorized per second. Nalllatech 510T versus Nvidia K401


The Nvidia K40 GPGPU has nominal power consumption of 235 Watts compared to the 150W of the Nallatech 510T. This gives the FPGA implementation a significant performance/power advantages versus the GPGPU.

Figure 8 : Relative Power/Image. Nallatech 510T versus Nvidia K401

Figure 8 : Relative Image/Power. Nallatech 510T versus Nvidia K401

1 Caffe implementation of ImageNet. http://caffe.berkeleyvision.org


The unique flexibility of FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN – Convolutional Neural Network calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of an FPGA implementation allows for object recognition in 9 milliseconds (a single frame period), ideal for situations where low latency is crucial. E.g. object avoidance. This permits images to be categorized at a frame rate greater than 100 Hz.

The intrinsic scalability demonstrated by our FPGA implementation can be utilized to implement complex CNN – Convolutional Neural Networks on increasingly smaller and lower power FPGAs at the expense of some performance. This allows less demanding applications to be implemented on extremely low power FPGA devices, particularly useful for embedded solutions, E.g. Near sensor computing.

Figure 9 : Miniaturized packaging module for near sensor processing (FPGA, memory and support circuitry)

Figure 9 : Miniaturized packaging module for near sensor processing (FPGA, memory and support circuitry)










By packaging FPGAs with sensor hardware it is possible to use the power of CNN – Convolutional Neural Network image recognition near the sensor ensuring low latency is maintained and optimizing bandwidth between the sensor and the host.

Molex Acquires Interconnect Systems, Inc

Acquisition strengthens Molex advanced high performance computing solution offering

LISLE, IL – April 7, 2016
– Molex, a leading global manufacturer of electronic solutions, announced today the acquisition of Interconnect Systems, Inc. (“ISI”) which specializes in the design and manufacture of high density silicon packaging with advanced interconnect technologies.

According to Tim Ruff, senior vice president, Molex, the acquisition enables Molex to offer a wider range of fully integrated solutions to customers worldwide. “We are excited about the unique capabilities and technologies the ISI team brings to Molex. ISI’s proven expertise in high-density chip packaging strengthens our platform for growth in existing markets and opens doors to new opportunities.”

Headquartered in Camarillo, California, ISI delivers advanced packaging and interconnect solutions to top-tier OEMs in a wide range of industries and technology markets, including aerospace & defense, industrial, data storage and networking, telecom, and high performance computing. ISI uses a multi-discipline customized approach to improve solution performance, reduce package size, and expedite time-to-market for customers.

“We are thrilled to join forces with Molex. By combining respective strengths and leveraging their global manufacturing footprint, we can more efficiently and effectively provide customers with advanced technology platforms and top-notch support services, while scaling up to higher volume production,” said Bill Miller, president, ISI.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Molex Resources:

Learn more about Molex at http://www.molex.com
Follow us at www.twitter.com/molexconnectors
Watch our videos at www.youtube.com/molexconnectors
Connect with us at www.facebook.com/molexconnectors
Read our blog at www.connector.com

How FPGA Boards Can Take On GPUs And Knights Landing

March 17, 2016 Timothy Prickett Morgan

Nallatech 510T FPGA boardNallatech doesn’t make FPGAs, but it does have several decades of experience turning FPGAs into an FPGA board, including devices and systems that companies can deploy to solve real-world computing problems without having to do the systems integration work themselves.

With the formerly independent Altera, now part of Intel, shipping its Arria 10 FPGAs, Nallatech has engineered a new coprocessor FPGA board that will allow FPGAs to keep pace with current and future Tesla GPU accelerators from Nvidia and “Knights Landing” Xeon Phi processors and coprocessors from Intel. The architectures of the devices share some similarities, and that is no accident because all HPC applications are looking to increase memory bandwidth and find the right mix of compute, memory capacity, and memory bandwidth to provide efficient performance on parallel applications.

Like the Knights Landing Xeon Phi, the new 510T FPGA board uses a mix of standard DDR4 and Hybrid Memory Cube (HMC) memory to provide a mix of high bandwidth, low capacity memory with high capacity, relatively low bandwidth memory to give an overall performance profile that is better than a mix of FPGAs and plain DDR4 together on the same card. (We detailed the Knights Landing architecture a year ago and updated the specs on the chip last fall. The specs on the future “Pascal” Tesla GPU accelerators, such as we know them, are here.)

In the case of the 510T FPGA board from Nallatech, the compute element is a pair of Altera Arria 10 GX 1150 FPGAs, which are etched in 20 nanometer processes from foundry partner Taiwan Semiconductor Manufacturing Corp. The higher-end Stratix 10 FPGAs are made using Intel’s 14 nanometer processes and pack a lot more punch with up to 10 teraflops per device, but they are not available yet. Nallatech is creating FPGA board coprocessors that will use these future FPGAs. But for a lot of workloads, as Nallatech president and founder Allan Cantle explains to The Next Platform, the compute is not as much of an issue as memory bandwidth to feed that compute. Every workload is different, so that is no disrespect to the Stratix 10 devices, but rather a reflection of the key oil and gas customers that Nallatech engaged with to create the 510T FPGA board.

“In reality, these seismic migration algorithms need huge amounts of compute, but fundamentally, they are streaming algorithms and they are memory bound,” says Cantle. “When we looked at this for one of our customers, who was using Tesla K80 GPU accelerators, somewhere between 5 percent and 10 percent of the available floating point performance was actually being used and 100 percent of the memory bandwidth was consumed. That Tesla K80 with dual GPUs has 24 GB of memory and 480 GB/sec of aggregate memory bandwidth across those GPUs, and it has around 8.7 teraflops of peak single precision floating point capability. We have two Arria 10s, which are rated at 1.5 teraflops each, which is just around 3 teraflops total but I think the practical upper limit is 2 teraflops., but that is just my personal take. But when you look at it, you only need 400 gigaflops to 800 gigaflops, making very efficient use of the FPGA’s available flops, which you cannot do on a GPU.”

The issue, says Cantle, is that the way the GPU implements the streaming algorithm at the heart of the seismic migration application that is used to find oil buried underground, it makes many accesses to the GDDR5 memory in the GPU card, which is what is burning up all of the memory bandwidth. “The GPU consumes its memory bandwidth quite quickly because you have to come off chip the way the math is done,” Cantle continues. “The opportunity with the FPGA board is to make this into a very deep pipeline and to minimize the amount of time you go into global memory.”

Nallatech 510T FPGA board

The trick that Nallatech is using is putting a block of HMC memory between the two FPGAs on the FPGA board, which is a fast, shared memory space that the two FPGAs can actually share and address at the same time. The 510T FPGA board is one of the first compute devices (rather than networking or storage devices) that is implementing HMC memory, which has been co-developed by Micron Technology and Intel, and it is using the second generation of HMC to be precise. (Nallatech did explore first generation HMC memory on FPGA accelerators for unspecified government customers, but this was not commercially available as the 510T FPGA board is.)

In addition to memory bandwidth bottlenecks, seismic applications used in the oil and gas industry also have memory capacity issues. The larger the memory that the compute has access to, the larger the volume (higher number of frequencies) that the seismic simulation can run. With the memory limit on a single GPU, says Cantle, this particular customer was limited to approximately 800 volumes (it is actually a cube). Oil and gas customers would love to be able to do 4K volumes (again cubed), but that would require about 2 TB of memory to do.

So the 510T FPGA board has four ports of DDR4 main memory to supply capacity to store more data to do the larger and more complex seismic analysis, and by ganging up 16 FPGA boards together across hybrid CPU-FPGA nodes, Nallatech can break through that 4K volumes barrier and reach the level of performance that oil and gas companies are looking for.

Here is the block diagram of the 510T FPGA board:

510T FPGA board diagram

The HMC memory comes in 2 GB capacity, with 4 GB being optional, and has separate read and write ports, each of which deliver 30 GB/sec of peak bandwidth per FPGA on the card. The four ports of DDR4 memory that link to the other side of the FPGAs deliver 32 GB of capacity per FPGA (with an option of 64 GB per FPGA) and 85 GB/sec of peak bandwidth. So each card has 290 GB/sec of aggregate bandwidth and 132 GB of memory for the applications to play in.

These FPGA boards slide into a PCI-Express x16 slot, and in fact, Nallatech has worked with server maker Dell to put these into a custom, high-end server that can put four of these FPGA boards and two Xeon E5 processors into a single 1U rack-mounted server. The Nallatech 510T cards cost $13,000 each at list price, and the cost of a server with four of these plus an OpenCL software development kit and the Altera Quartus Prime Pro FPGA design software added to is $60,000.

Speaking very generally, the two-FPGA boards can deliver about 1.3X the performance of the Tesla K80 running the seismic codes at this oil and gas customer in about half the power envelope, says Cantle and there is a potential upside of 10X performance for customers that have larger volume datasets or who are prepared to optimize their algorithms to leverage the strengths of the FPGA. But Nallatech also knows that FPGAs are more difficult to program than GPUs at this point, and is being practical about the competitive positioning.

“At the end of the day, everyone needs to be a bit realistic here,” says Cantle. “In terms of price/performance, FPGA cards do not sell in the volumes of GPU boards, so we hit a price/performance limit for these types of algorithms. The idea here is to prove that today’s FPGAs are competent at what GPUs are great at. For oil and gas customers, it makes sense for companies to weigh this up. Is it a slam dunk? I can’t say that. But if you are doing bit manipulation problems – compression, encryption, bioinformatics – it is a no brainer that the FPGA is far better – tens of times faster – than the GPU. There will be places where the FPGA will be a slam dunk, and with Intel’s purchase of Altera, their future is certainly bright.”

The thing we observe is that companies will have to not look just at raw compute but how their models can scale across the various memory in a compute element and across multiple elements lashed together inside of a node and across nodes.

Source: TheNextPlatform

Low Latency Key-Value Store / High Performance Data Center Services

Gateware Defined Networking® (GDN) – Search Low Latency Key-Value Store

Low Latency Key-Value Store (KVS) is an essential service for multiple applications. Telecom directories, Internet Protocol forwarding tables, and de-duplicating storage systems, for example, all need key-value tables to associate data with uniqu e identifiers. In datacenters, high performance KVS tables allow hundreds or thousands of machines to easily share data by simply associating values with keys and allowi ng client machines to read and write those keys and values over standard high-speed Ethernet.

Algo-Logic’s KVS leverages Gateware Defined Netw  orking® (GDN) on Field Programmable Gate Arrays (FPG As) to perform lookups with the lowest latency (less than  1 microsecond), with the highest throughput, and the least processing energy.   Deploying GDN solutions save netw ork operators’ time, cost, and power resulting in signifi cantly lower Total Cost of Ownership (TCO). DOWNLOAD FULL REPORT

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Data centers require many low-level network services to implement high-level applications. Key-Value Store (KVS) is a critical service that associates values with keys and allows machines to share these associations over a network. Most existing KVS systems run in software and scale out by running parallel processes on multiple microprocessor cores to increase throughput.

In this paper, we take an alternate approach by implementing an ultra-low-latency KVS in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, lookup transactions are sent over Ethernet to the machine that stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher search throughput with much lower latency and power consumption than other implementations in software. DOWNLOAD FULL REPORT

Move your mouse over image or click to enlarge

Move your mouse over image or click to enlarge

Nallatech 510T FPGA Datacenter Acceleration


Datacenter Acceleration Hardware:
Nallatech 510T FPGA Accelerator Disrupts the Datacenter

Maximum performance, minimum power FPGA accelerator with OpenCL tool flow

Camarillo, CA – Nallatech, a leading supplier of high-performance FPGA solutions, announces the introduction of the 510T™ - an FPGA co-processor designed to deliver ultimate performance per watt for compute-intensive datacenter applications.

The 510T is a GPU-sized 16-lane PCIe 3.0 card featuring two of Altera’s new floating-point enabled Arria 10 FPGAs delivering up to sixteen times the performance of the previous generation. Applications can achieve a total sustained performance of up to 3 TFlops.
Nallatech 510T Datacenter Co-ProcessorDeliverables include an optimized Board Support Package compatible with the Altera Software Development Kit (SDK) for OpenCL. This allows the card to be programmed at a high level of abstraction by customers unfamiliar with hardware-based tool flows historically required for FPGAs.

View all Nallatech FPGA Cards

The 510T is available with an unprecedented 290GByte/sec of peak external memory bandwidth configured as eight independent banks of DDR4 plus an ultra-fast Hybrid Memory Cube (HMC). This combination, plus the FPGA’s on-chip memory bandwidth of 14.4TBytes/sec, permits dramatic new levels of performance per watt for memory-bound applications.

“Until now, FPGA accelerators have typically been deployed as network-attached add-on cards at the periphery of the datacenter tasked with real-time streaming functions such as compression, encryption and filtering” said Allan Cantle, President and Founder of Nallatech. “The 510T pushes the FPGA into the heart of the datacenter as a pure co-processor, providing customers with an OpenCL-programmable accelerator in a GPU-form factor, but using only a fraction of the power. This allows customers to increase performance while reducing OPEX.”

“We’re delighted to see Nallatech further the adoption of FPGA-based computing with the introduction of the Arria 10-based 510T” said Mike Strickland, Director of Strategic Marketing at Altera. “The 510T accelerator card, used in conjunction with Altera’s OpenCL SDK, raises the bar to a new level of application performance and delivers an energy-efficient alternative to GPUs.”

510T cards will ship during Q3 2015. Customers can purchase cards individually or as integrated servers pre-loaded with tools including the Altera OpenCL SDK and Nallatech Board Support Packages. To learn more, please visit www.nallatech.com/510T or speak to one of our experts at upcoming events:

About Nallatech
Nallatech, a subsidiary of Molex, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the worlds’ largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar.

Press Contact:
Tom Robinson
+1 (805) 383-3458

Nallatech 385A Leads the FPGA Accelerator Revolution

Nallatech 385A Leads the FPGA Accelerator Revolution
Featuring the Industry’s first floating point-enabled FPGA with OpenCL tool flow

Camarillo, CA – Nallatech, a leading supplier of high-performance FPGA solutions, announces the introduction of the 385A™ – a production-ready, server-qualified FPGA card capable of accelerating a new generation of energy-efficient datacenter applications.

Nallatech 385A - FPGA Card - Accelerate your Network with OpenCLThe 385A™ is a half-height, half-length PCIe Gen 3 card featuring Altera’s new floating-point enabled Arria 10 FPGA family capable of delivering 1.5 Peak TFlops of floating point performance. Two independent banks of SDRAM memory and dual QSFP+ network ports complete the balanced architecture capable of both co-processing and latency-critical 1G/10G/40G streaming applications.

View all Nallatech FPGA Cards

The 385A™ is delivered with an optimized Board Support Package compatible with the Altera Software Development Kit (SDK) for OpenCL. This allows the card to be programmed at a high level of abstraction by customers unfamiliar with traditional FPGA hardware-based tool flows.

“Intel’s acquisition of Altera marks a dramatic new chapter in the high performance computing Industry with FPGAs destined to play a key role” said Allan Cantle, president and founder of Nallatech. “FPGAs are no longer just a curiosity to be benchmarked. Major companies such as Microsoft and IBM are deploying high volumes of FPGAs in production systems where a step-change in both application performance and energy-efficiency are needed, but cannot be realized using only CPU plus GPU configurations. The 385A is the ideal product for customers wanting to cost-effectively add FPGA fabric to their data center.”

cards are shipping now! Customers can purchase cards individually or as integrated servers pre-loaded with tools including the Altera OpenCL SDK and Nallatech Board Support Packages. To learn more, please visit www.nallatech.com/385A or speak to one of our experts at upcoming events:


About Nallatech

Nallatech, a subsidiary of Molex, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the worlds’ largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar.


Press Contact:
Tom Robinson
+1 (805) 383-3458

Microsoft Supercharges Bing Search With Programmable Chips

DOUG BURGER CALLED it Project Catapult.

Burger works inside Microsoft Research–the group where the tech giant explores blue-sky ideas–and in November 2012, he pitched a radical new concept to Qi Lu, the man who oversees Microsoft’s Bing web search engine. He wanted to completely change the machines that make Bing run, arming them with a new kind of computer processor.

Microsoft Supercharges Bing Search

Like Google and every other web giant, Microsoft runs its web services atop thousands of computer servers packed into warehouse-sized data centers, and most of these machines are equipped with ordinary processors from Intel, the world’s largest chip maker. But when he sat down with Lu, Burger said he wanted millions of dollars to build rack after rack of computer servers that used what are called field-programmable arrays, or FPGAs, processors that Microsoft could modify specifically for use with its own software. He said that these chips–built by a company called Altera–could not only speed up Bing searches, but also change the way Microsoft run all sorts of other online services.

Despite the cost, and the riskiness of the proposition, Lu liked the idea. In a first for Microsoft, he approved a 1,600-server pilot-system to test out Burger’s ideas, and now, he has given the green light to actually move these FPGAs into Microsoft’s live data centers. This is set to happen early next year. That means that a few months from now, when you do a Bing search, there’s a decent chance that it will be carried out by one of Burger’s servers.

The move is part of a larger effort to fix what is an increasingly worrisome problem for big web companies like Microsoft, Google, and Facebook. After decades of regular performance boosts, chips are no longer improving at the same rate they once were. As their web services continue to grow, these companies are looking for new ways of improving the speed and efficiency of their already massive operations. Facebook is exploring the use of low-power ARM processors. According to reports, Google is too. And now Microsoft is about to roll out FPGAs. “There are large challenges in scaling the performance of software now,” says Burger. “The question is: ‘What’s next?’ We took a bet on programmable hardware.”

“There are large challenges in scaling the performance of software now,” says Burger.”The question is: ‘What’s next?’ We took a bet on programmable hardware.”

FPGAs, like the Altera chips that Microsoft used in its pilot project, have been around for years. A decade ago, they were widely used by chip designers as a low-cost way to prototype their new products. But lately, they’ve crept into networking gear, complex computer rigs that run the bitcoin digital currency, and even some specialized systems used by Wall Street firms to do data analysis. They give hardware makers more freedom to customize their gear.

Using FPGAs, Microsoft engineers are building a kind of super-search machine network they call Catapult. It’s comprised of 1,632 servers, each one with an Intel Xeon processor and a daughter card that contains the Altera FPGA chip, linked to the Catapault network. The system takes search queries coming from Bing and offloads a lot of the work to the FPGAs, which are custom-programmed for the heavy computational work needed to figure out which webpages results should be displayed in which order. Because Microsoft’s search algorithms require such a mammoth amount of processing, Catapult can bundle the FPGAs into mini-networks of eight chips.

Microsoft Supercharges Bing Search

The FPGAs are 40 times faster than a CPU at processing Bing’s custom algorithms, Burger says. That doesn’t mean Bing will be 40 times faster–some of the work is still done by those Xeon CPUs–but Microsoft believes the overall system will be twice as fast as Bing’s existing system. Ultimately, this means Microsoft can operate a much greener data center. “Right off the bat we can chop the number of servers that we use in half,” Burger says.

What’s more, Microsoft can update the chips in much the same way it updates Bing’s system software, and Burger and his team can modify the logic on their processors to address bugs and changes in the Bing search algorithm. They do this by building a binary file that represents the updated chip logic and distributing it though Microsoft’s standard server management software, called Autopilot. It’s not uncommon to have several chip updates per week, Burger says.

Of course, there have been challenges. There was a lab flood and a fire with one of their Taiwanese parts suppliers, and as it stands, Microsoft server monitoring tools didn’t always know what to make of chips that are suddenly dropping offline and restarting with reconfigured logic. But Microsoft is confident that the new FPGAs can be used across the company’s online empire. “If all we were doing was improving Bing, I probably wouldn’t get clearance from my boss to spend this kind of money on a project like this,” says Peter Lee, the head of Microsoft Research. “The Catapult architecture is really much more general-purpose, and the kinds of workloads that Doug is envisioning that can be dramatically accelerated by this are much more wide-ranging.”

It’s also the kind of work that’s likely to be emulated at other big web companies who have the resources to hire hardware developers, says James Larus, dean of the School of Computer and Communications Sciences with the École Polytechnique Fédérale de Lausanne. He previously worked at Microsoft on Project Catapult. “The benefits of hardware specialization are far too large for the right application for these companies to pass up the opportunity,” he says.

According to Burger, developing a whole new chip architecture for one of the world’s largest data center operators is the kind of thing that Microsoft Research does pretty well. “Let’s jump way out, think of something a little crazy, and then push on it and see how well that works,” he says. Come 2015, you can get the answer that question simply by searching Bing.


Support Lounge Login

Forgot Password?

Join Us

Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.