About FPGASolutions

This author has not yet filled in any details.
So far FPGASolutions has created 23 blog entries.

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs

Nalllatech Whitepaper – Accelerating Kafka Producers with FPGAs

Introduction – Accelerate Kafka Producers with FPGAs

Apache Kafka is at the heart of emerging universal streaming data pipeline. Kafka’s has many high-profile adoptions as the streaming platform of choice being used at LinkedIn, Netflix, Uber, ING along with over one third of the Fortune 500 and growing. At LinkedIn, approximately two trillion messages per day pass through Kafka. According to TechRepublic.com, six of top 10 travel companies, seven of top 10 global banks, eight of the top 10 insurance companies and nine of top 10 telecom companies have adopted Kafka as the central platform for managing streaming data. At the 2017 New York Kafka Summit, Confluent reported over one third of the Fortune 500 have deployed Kafka.

Basic Kafka System

Kafka has three essential components – producers, brokers and consumers. Producers publish data to topics on brokers and consumers subscribe to topics. Figure 1 shows a basic Kafka system.

Figure 1 - Basic Kafka System

Figure 1 – Basic Kafka System

One of many the advantages of the Kafka architecture is the decoupling of producers and consumers. Producers and consumers can be at wildly different data rates and yet have no effect on each other. The other key advantage of Kafka is its small size. With just over 90,000 lines of code, Kafka clusters can be implemented on much more modest hardware requirements than Spark Streaming which requires a full Spark node.

Accelerating Kafka Producers

Data ingest into big data systems ranges from simple to complex. In figure 2, data source 1 may be a packet captures of network traffic. However, data source two could be complex geospatial images from a constellation of satellites, while data source three is industrial IoT maintenance data on a windmill farm in West Texas.

Figure 2 : Streaming Data Ingest Acceleration with Intel FPGAs

Figure 2 : Streaming Data Ingest Acceleration with Intel FPGAs

The variability in data formats and data rates makes the problem difficult to scale. Being able to adapt in real-time to burst in traffic and new formats is often costly requiring provisioning of additional NICs and processors. Figure 3 shows a typical processor based architecture used in most Kafka clusters.

Figure 3 : Typical Ingest Pathway

Figure 3 : Typical Ingest Pathway

Data rate variability makes the system in figure 3 difficult to plan. In many cases, the maximum bandwidth must be estimated and then provisioned. 50% or more excess processors and NICs will be idled waiting for increases in data rates.

Moving to an Intel FPGA based solution, the same maximum bandwidth will be estimated, but the simplified system in figure 4 will have much lower power while idle and requires considerable less footprint overall. The system in figure 2 will also eliminate flow control and load balance management needed for processor based system because the Intel FPGA based approach is deterministic regardless of data rate or data formats.

Intel FPGAs are streaming, parallel accelerators that attach directly to copper, fiber & optical wires. Unlike traditional GPUs and CPUs, Intel FPGAs can move any data in any format from wire to memory in nanoseconds without the need of a Network Interface Card (NIC).

This acceleration of ingest can result in 40X lower latency in data ingest to Kafka producer. It provides the option for simultaneous real-time processing of the inflowing data such as by implementing machine learning, image recognition, pattern matching, filtering, compression, encryption etc. Ingested data can be therefore accelerated and enriched to speed time to data acquisition and data analysis.

Use Case One:
Inline Extract & Transformation 

The most basic use case for FPGA ingest into a Kafka producer is shown in figure 4. Even for this most basic use case, the FPGA provides low latency and determinism for even extremely variable rates. The ability to extract and transform the data with OpenCL allows this use case to handle 10s to 100s of data types.

Figure 4 Inline, Low Latency, Deterministic, Extraction & Transformation

Figure 4 Inline, Low Latency, Deterministic, Extraction & Transformation

Use Case Two:
Inline Encryption & Decryption

Encryption is extremely expensive in processor cycle, but well understood on Intel FPGAs. FPGAs provide a low latency and deterministic result without a dependency on the data rate. For processors, variable data rates could flood the processor resources and cause a bottleneck and/or start dropping packet.

Figure 5 Inline, Low Latency, Deterministic, Encryption or Decryption

Figure 5 Inline, Low Latency, Deterministic, Encryption or Decryption

Use Case Three:
Inline Compression & Decompression

FPGA’s are extremely efficient at compression and decompression. In this use case the FPGA is used to compress/decompress data before it is passed to the Kafka system.

Figure 6 Inline, Low Latency, Deterministic Compression or Decompression

Figure 6 Inline, Low Latency, Deterministic Compression or Decompression

Use Case Four:
Information Theory with Encrypted/Decrypted &
Compressed/Decompressed Streams

Shannon’s law is being applied to more streaming use cases to determine if a stream is encrypted. Shannon’s law calculates the entropy of a packets looking for randomness versus structured bytes. Many encrypted bytes look, but not all, similar structured data. Figure 7 shows a possible flow to calculate the entropy, attempt to decrypt and then decompress before being published to a Kafka topic. Even if the decryption and/or decompress could not be done successfully, sorting encrypted vs decrypted streams has many applications in industries, such as personal identifiable information like finance and health care.

Use Case Five:
Enriched Topic Routing

Figure 8 Enriched Topic Routing of PCAPs for Cyber Analytics

Figure 8 Enriched Topic Routing of PCAPs for Cyber Analytics

Kafka’s flexible topic architecture that allows ingested data to be placed into many topics. This flexibility means incoming data can be routed/switched using machine learning and pattern matching. Take figure 9 above which shows raw network packets being captured (PCAPS). As the packets are captured, complex pattern matching using PCRE expressions can route to the appropriate topics. This allows the Kafka consumers to subscribe to enriched topics and bypass a cleaning stage. For many cyber analytics applications, the processing realizes a 1000X improvement in cyber operations per watt based on research published by DOE Sandia & Lewis Rhodes Labs.

Nallatech 385A Cloudera/Intel Example

The Nallatech 385A provides two network ports supporting up to 40Gbe/sec each. This NIC size card can replace existing NIC/CPU combination to significantly accelerate existing Kafka networks and reduce power.

This has been verified by Cloudera and Intel to accelerate Kafka to Spark streaming, whilst performing data enrichment on the FPGA (Figure 9).

Figure 9 Enriched data using 385A

Figure 9 Enriched data using 385A

In the above demonstration, we have chosen engine noise signatures as our input data stream. They are ingested and offloaded via an UDP offload engine and placed into the card’s OpenCL environment. OpenCL code running on the card performs real-time formatting on the incoming data stream. It then performs an FFT, feature extraction and classifies the signal as “normal” or “abnormal” based on comparison with known engine signatures. This extra bit of data along with the FFT of the engine signals are DMA into Kafka for further processing.

This example also highlights the flexibility of OpenCL generated libraries which can be applied to incoming streaming data. This offers then end user immense latitude to include very application specific forms of data enrichment or data filtering.

520N: 100 Gbe with Stratix10

The Nallatech 520N four network ports enable support for an array of serial I/O protocols operating up at 10/25/40/100Gz. With a total throughput of up to 400 Gbe/sec, the 520N is cable of enriching high volumes of data prior to offloading to a Kafka framework.

Figure 10 Enriched data using 520N

Figure 10 Enriched data using 520N

The 520N is populated with the powerful Stratix 10 FPGA offering unparalleled performance.
With the combination of high throughput, large amounts of compute and programmability using OpenCL, it is possible to perform complex data enrichment on streaming data on a single device.

More Information and How to Evaluate

Nallatech along with Intel PSG are experts at Kafka acceleration. Nallatech has current and planned products to accelerate Apache Kafka using Arria 10 and Stratix 10 FPGAs. Please contact Nallatech to discuss your needs and develop an accelerated solution.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

 Nallatech 385A – w/Arria10 / GX1150 FPGA

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs 2017-11-14T08:04:01+00:00

FPGA Acceleration of Convolutional Neural Networks

Nalllatech Whitepaper – FPGA Accelerated CNN

Introduction – CNN – Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been shown to be extremely effective at complex image recognition problems. This white paper discusses how these networks can be accelerated using FPGA accelerator products from Nallatech, programmed using the Altera OpenCL Software Development Kit. This paper then describes how image categorization performance can be significantly improved by reducing computation precision. Each reduction in precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center and by community contributors.

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:
• Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
• Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
• Rectified-Linear: Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
• InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network
Figure 1 : Example illustration of a typical CNN – Convolutional Neural NetworkTo access the accelerated FPGA version of the code the user need only change the description of the CNN layer in the Caffe XML network description file to target the FPGA equivalent.

AlexNet

Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : AlexNet CNN – Convolutional Neural Network

AlexNet is a well know and well used network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the AlexNet CNN, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the AlexNet CNN. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. To create different layers optimized for the different convolution layers would be inefficient. This is because the computational time of each layer differs depending upon the number of filters applied and the size of the input images. due to the number of input and output features processed. However, each convolution requires a different number of layers and a different number of pixels to process. By increasing the resource applied to more compute intensive layers, each layer can be balanced to complete in the same amount of time. Hence, it is therefore possible to create a pipelined process that can have several images in flight at any one time maximizing the efficiency of the logic used. I.e. most processing elements are busy most of the time.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 1 : ImageNet layer computation requirements

Table 1 shows the computation required for each layer of the Imagenet network. From this table it can be seen that the 5×5 convolution layer requires more compute than the other layers. Therefore, more processing logic for the FPGA will be required for this layer to be balanced with the other layers.

The inner product layers have a n to n mapping requiring a unique coefficient for each multiply add. Inner product layers usually require significantly less compute than convolutional layers and therefore require less parallelization of logic. In this scenario it makes sense to move the Inner Product layers onto the host CPU, leaving the FPGA to focus on convolutions.

FPGA logic areas

FPGA devices have two processing regions, DSP and ALU logic. The DSP logic is dedicated logic for multiply or multiply add operators. This is because using ALU logic for floating point large (18×18 bits) multiplications is costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating pointer operations. To increase the performance for CNN processing it is necessary to increase the number of multiplications that be implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 and Stratix 10 devices from Altera have embedded floating DSP blocks that can also be used as fixed point multiplications. Each DSP component can in fact be used as two separated 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 3 : Arria 10 floating point DSP configuration

Figure 3 : Arria 10 floating point DSP configuration

If a reduced precision floating point processing is required it is possible to use half precision. This requires additional logic from the FPGA fabric, but doubles the number of floating point calculations possible, assuming the lower bit precision is still adequate.

One of the key advantages of the pipeline approach described in this white paper is ability to vary accuracy at different stages of the pipeline. Therefore, resources are only used where necessary, increasing the efficiency of the design.

Figure 4 : Arria 10 fixed point DSP configuration


Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNNs application tolerance, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU logic. This doubles the number of multiplications possible compared to just using the FPGA DSP logic. Some networks maybe tolerant to even lower bit precision. The FPGA can handle all precisions down to a single bit if necessary.

For the CNN layers used by AlexNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than a 1% error versus a single precision floating point operation.

CNN convolution layers

Using a sliding window technique, it is possible to create convolution kernels that are extremely light on memory bandwidth.

Figure 5 : Sliding window for 3x3 convolution

Figure 5 : Sliding window for 3×3 convolution

Figure 5 illustrates how data is cached in FPGA memory allowing each pixel to be reused multiple times. The amount of data reuse is proportional to the size of the convolution kernel.

As each input layer influences all output layers in a CNN convolution layer it is possible to process multiple input layers simultaneously. This would increase the external memory bandwidth required for loading layers. To mitigate the increase all data, except for coefficients, is stored in local M20K memory on the FPGA device. The amount on chip memory on the device limits the number of CNN layers that can be implemented.

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Most CNN features will fit within a single M20K memory and with thousands of M20Ks embedded in the FPGA fabric, the total memory bandwidth available for convolution features in parallel is in the order of 10’s Terabytes/sec.

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Depending upon the amount of M20K resource available it is not always possible to fit a complete network on a single FPGA. In this situation, multiple FPGA’s can be connected in series using high speed serial interconnects. This allows the network pipeline to be extended until sufficient resource is available.
A key advantage to this approach is it does not rely on batching to maximize performance, therefore the latency is very low, important for latency critical applications.

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Balancing the time taken between layers to be the same requires adjusting the number of parallel input layers implemented and the number of pixels processed in parallel.

Figure 9: Resources for 5x5 convolution layer of Alexnet

Figure 9: Resources for 5×5 convolution layer of Alexnet

 

Figure 9 lists the resources required for the 5×5 convolution layer of Alexnet with 48 parallel kernels, for both a single precision and 16 bit fixed point version on an Intel Arria10 FPGA. The numbers include the OpenCL board logic, but illustrate the benefits of lower precision has on resource.

Fully Connected Layer
Processing of a fully connected layer requires a unique coefficient for each element and therefore quickly becomes memory bound with increasing parallelism. The amount of parallelism required to keep pace with convolutional layers would quickly saturate the FPGA’s off chip memory, therefore it is proposed that his stage of the input layers either batched or pruned.

As the number of elements for an inner product layer is small the amount of storage required for batching is small versus the storage required for the convolution layers. Batching layers then allows the same coefficient to be used for each batched layer reducing the external memory bandwidth.

Pruning works by studying the input data and ignoring values below a threshold. As fully connected layers are placed at the later stages of a CNN network, many possible features have already been eliminated. Therefore, pruning can significantly reduce the amount of work required.

Resource
The key resource driver of the network is the amount of on chip M20K memories available to store the outputs of each layer. This is constant and independent of the amount of parallelism achieved. Extending the network over multiple FPGA’s increases the total amount of M20K memory available and therefore the depth of the CNN that can be processed.

Conclusion
The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of FPGA implementation allows single frame latency for object recognition, ideal for situations where low latency is crucial. E.g. object avoidance.

Using this approach for AlexNet (single precision for layer 1, then using 16 bit fixed for remaining layers), each image can be processed in ~1.2 milliseconds with a single Arria 10 FPGA, or 0.58 milliseconds with two FPGAs in series.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Convolutional Neural Networks 2017-11-13T15:13:58+00:00

Nallatech exhibiting at SuperComputing 17

Nallatech Showcases Next Generation FPGA Accelerators at Supercomputing 2017

Leaders in FPGA AccelerationVisit booth 1362 for Machine Learning and Kafka Data Ingest case studies using latest generation of FPGA accelerators and tools

LISLE, IL – November 13, 2017 – Nallatech, a Molex company, will showcase FPGA solutions for high-performance computing (HPC), low latency network acceleration and data analytics at the Supercomputing 2017 (SC17) Conference and Exhibition, November 13-16 in Denver, Colorado.

FPGA Acceleration Card with Stratix 10 FPGA
“FPGAs are being deployed in volume across a range of on-premise platforms and cloud infrastructure to achieve a step-change in application performance and energy-efficiency above and beyond what can be achieved using conventional processor technologies” said Craig Petrie, VP Business Development of FPGA Solutions at Nallatech. “We’re excited to be showcasing our new OpenCL-programmable ‘520’ product range featuring Intel Stratix-10 FPGAs. These server-qualified accelerator products have been engineered to cost-effectively solve demanding co-processing and real-time data ingest and enrichment applications.”

Nallatech will present two example applications featuring latest hardware and tools where FPGAs demonstrate significant value to customers:

Convolutional Neural Networks (CNN) – Object classification using a low profile Nallatech 385A™ PCIe accelerator card with a discrete Intel Arria 10 FPGA accelerator programmed using Intel’s OpenCL Software Development Kit. Built on the BVLC Caffe deep learning framework, an FPGA interface and IP accelerate processing intensive components of the algorithm. Nallatech IP is capable of processing an image through the AlexNet neural network in nine milliseconds. The Arria10-based 385A™ board has the capacity to process six CNN images in parallel allowing classification of 660 images per second.

KAFKA Ingest/Egress – Acceleration of KAFKA Producers using the advanced capabilities of Intel’s new Stratix-10 FPGA silicon and OpenCL Software Development Kit (SDK). This case study describes an analytic framework that provides up to 40 times increase in ingest performance enabling real-time data filtering and enrichment.

Additionally, Nallatech will display a range of leading-edge technologies at SC17 including:

520N™ Network Accelerator Card — A GPU/Phi-sized 16-lane PCIe Gen 3 card sporting four 100G network ports directly coupled to an Intel Stratix-10 FPGA. Four independent banks of DDR4 memory complete the balanced architecture capable of handling latency-critical 100G streaming applications.

520C™ Compute Acceleration Card – A GPU/Phi-sized 16-lane PCIe Gen 3 card, the OpenCL-programmable 520C™ features an Intel Stratix-10 FPGA designed to deliver ultimate performance per watt for compute-intensive HPC workloads.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Nallatech exhibiting at SuperComputing 17 2017-11-14T08:36:03+00:00

OpenCapi Blog: Post 1

Datacentric Architectures

Molex/Nallatech Leverages OpenCAPI
for 200GBytes/s of Hyperconverged
NVMe Storage Bandwidth
By Allan Cantle

Over the last decade the computing industry has managed to deliver application performance improvements and better energy-efficiency for its customers by embracing parallelism, co-processor type acceleration and techniques to bypass and unburden the CPU. These have worked on the premise of maintaining the CPU centric nature of the server while effectively adding data centric enhancements.

To maintain this rate of incremental improvement, the industry is now embracing many more system level enhancements to the fundamental computing architecture and the CPU is becoming an important member in a fundamentally data centric architecture, rather than being at the heart of that architecture.  With this architectural shift the network fabric is becoming the critical piece at the center and we can see this evidenced by the plethora of new fabric standards including Omnipath, NVLink, OpenCAPI, GenZ, CCIX and Infinity fabric to name a few. Each of these fabrics claim to either solve a piece of or all the communication requirements for future data centric architectures.

OpenCAPI is enjoying the early mover advantage as an excellent open standard conduit, both metaphorically and physically, in facilitating this data centric industry shift. This becomes even more important when you realize that the industry cannot leave behind CPU centric legacy software that will need to continue running for many decades to come.

It is critical to understand that OpenCAPI is singularly focused on being the best coherent, low latency and high bandwidth (25GBytes/S Tx & 25GBytes/s Rx) interconnect for the hyperconvergence of data centric architectural pieces within a node. Consequently, it is looking for a complimentary fabric to support the ingress and egress of data to and from the node. This will be a topic for a later blog.  OpenCAPI based hyperconverged solutions must also become more programmable in a similar vein to those developed earlier on CAPI such as CAPI SNAP, Storage Networking & Acceleration Programming, and framework.

Nallatech is a pioneer of data centric computing using FPGAs, where computational functions are built around flowing data streams. It has 24 years of experience in successfully helping customers to migrate and deploy data centric heterogeneous architectures featuring FPGA technology. OpenCAPI was designed to leverage the strengths of FPGA architectures and minimize the impact of their weaknesses. Figure 1 shows a block diagram of Nallatech’s perspective of how the OpenCAPI bus is at the heart of enabling the true emergence of data centric architectures.

IMG-OpenCapi-Blog-Post1

Figure 1 OpenCAPI enabling Data Centric architectures through a Hyperconverged & Disaggregatable Architecture

Critical to this industry transformation is the open collaborations of all the industries experts with their differing skillsets. This openness, especially at the interface level, will help to ensure that the best ideas win out and that everyone can innovate around these new standards to deliver the best solutions to the industries customer base including the essential software infrastructure stacks that will make this technology easily accessible to application developers.

With Nallatech’s data centric heritage, Molex & Nallatech are taking decades of experience in tackling complex data centric problems.  These include HPDA applications such as video analytics & AI to classical memory bound HPC problems like the seismic migration algorithms.  These new system level solutions, based around OpenCAPI, will deliver over 5x performance gains at power levels that realistically begin to approach the DOEs 20MW Exascale target.

Additionally Nallatech will leverage OpenCAPI to ensure that valuable memory resources can be effectively shared with the CPU without breaking the essential support of the legacy CPU centric code base.

Come by the OpenCAPI, Molex & Nallatech booths #1587-#1589, #1263 & #1362 where we will be showcasing how our Sawmill FSA (Flash Storage Accelerator) development platform brings up to 200GBytes/s of hyperconverged accelerated storage to the Google/Rackspace Zaius/Barreleye-G2 POWER9 OCP Platform. The Sawmill FSA is designed to natively support the benefits of OpenCAPI by providing the lowest possible latency and highest bandwidth to NVMe Storage with the added benefits of OpenCAPI Flash functionality and near storage FPGA acceleration. HPDA applications such as graph analytics, in-memory databases and bioinformatics are expected to benefit greatly from this platform.

OpenCapi Blog: Post 1 2017-11-13T06:32:06+00:00

OpenCapi Blog

OpenCapi Blog: Post 1

Molex/Nallatech Leverages OpenCAPI for 200GBytes/s of Hyperconverged NVMe Storage Bandwidth By Allan Cantle Over the last decade the computing industry has managed to deliver application performance [...]

By | November 13th, 2017|Categories: OpenCapi Blog|Comments Off on OpenCapi Blog: Post 1
OpenCapi Blog 2017-11-02T14:35:18+00:00

Nallatech exhibiting at International SuperComputing 17

Nallatech, a Molex company, will showcase next generation OpenCL-programmable FPGA accelerator products for datacentre and cloud service applications at ISC17 being held in Frankfurt, Germany, June 19-23, 2017. The annual exhibition represents one of the largest gatherings of high performance computing (HPC) industry leaders and experts displaying the latest innovations.
 
ISC17 Announcement
Nallatech – ISC17 Booth C-1250 will feature hardware, software products plus design services for customers building scale out datacentres and cloud-based services leveraging FPGA technology.
“International Supercomputing is the perfect event for Nallatech to introduce the “520” – our next generation energy-efficient accelerator product featuring Intel Stratix 10 FPGAs,” said Craig Petrie, VP Business Development FPGA Solutions, Nallatech. “The OpenCL-programmable 520 delivers twice the core performance over previous-generation FPGAs with up to 70% lower power consumption. This unprecedented price-performance coupled with Nallatech’s application expertise and extensive manufacturing capabilities allow our customers to benchmark and deploy large-scale FPGA solutions within minimal cost and risk.” 
Advancements in architecture and high-level programming tools are opening doors for new FPGA use cases. For more information on how Nallatech streamlines FPGA integration and supports customers in the transition from prototyping to production, please visit www.nallatech.com
About Nallatech
Nallatech is a leading supplier of FPGA accelerated computing solutions. Since 1993, Nallatech has provided hardware, software and design services to enable customer’s success in applications including high performance computing, network processing, and real-time embedded computing.

About ISC17
ISC17 High Performance exhibition features the largest collection of HPC vendors, universities, and research organizations annually assembled in Europe. Together, they represent a level of innovation, diversity and creativity that are the hallmarks of the global HPC community. Having them all available on the same exhibition floor presents a unique opportunity for users to survey the HPC landscape and for vendors to display their latest and greatest wares.
Nallatech exhibiting at International SuperComputing 17 2017-06-19T12:18:12+00:00

Nallatech Officially Joins Dell Technology Partner Program

Nallatech Joins the Dell Technology Partner Program

Nallatech and Dell partner to offer High-Performance Computing in the Datacenter

Nallatech FPGA Accelerator - Molex

CAMARILLO, CA – March 11, 2017 – Nallatech, a Molex company, recently announced an official membership in Dell’s Technology Partnership Program. This new partnership will help accelerate the datacenter more efficiently than ever before.

Nallatech will continue to integrate FPGA Accelerators in Dell Servers, but this new partnership will help sway those on the fence that have not completely bought in to this model of computing. FPGA experts and newcomers will be empowered to utilize this mainstream method of FPGA algorithm development and deployment in the datacenter. With this ready-to-use solution FPGA programmers can focus completely on developing their own massively parallel and compute intensive applications while reducing power consumption and total cost of ownership.

Read more about Nallatech’s new partnership with Dell – Click here

The Dell Technology Partner Program
Nallatech is a Dell Technology Partner. The 385A and 510T FPGA OpenCL Accelerators are certified by Dell to run on Dell platforms that are specified in the above technical overview.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Nallatech Officially Joins Dell Technology Partner Program 2017-03-22T13:49:36+00:00

SuperComputing 2016 – Nallatech Exhibiting

Nallatech Drives FPGA Accelerator Revolution at Supercomputing 2016

Visit booth 2214 for Deep Learning CNN demonstration using latest generation of FPGA accelerators and tools

Nallatech FPGA Accelerator - Molex

LISLE, IL – November 14, 2016Nallatech, a Molex company, will showcase FPGA solutions for high-performance computing (HPC), network acceleration and data analytics at the Supercomputing 2016 (SC16) Conference and Exhibition, November 14-17 in Salt Lake City.

“Major datacenter customers are now deploying FPGAs to achieve break-through application performance and energy-efficiency above and beyond what can be achieved using conventional CPU and GPU platforms,” said Allan Cantle, President and Founder, Nallatech. “In-depth design, application expertise and process knowledge combined with extensive manufacturing capabilities allow for quick deployments of large-scale FPGA solutions.”

A demonstration at the Nallatech booth 2214 will showcase FPGA acceleration of Convolutional Neural Networks (CNN) object classification by using a low profile Nallatech 385A™ PCIe accelerator card with a discrete Intel Arria 10 FPGA accelerator programmed using Intel’s OpenCL Software Development Kit. Built on the BVLC Caffe deep learning framework, an FPGA interface and IP accelerate processing intensive components of the algorithm. Nallatech IP is capable of processing an image through the AlexNet neural network in nine milliseconds. The Arria10-based 385A™ board has the capacity to process six CNN images in parallel allowing classification of 660 images per second.

Additionally, Nallatech will display a range of leading-edge technologies at SC16—

  • 385A™ FPGA Accelerator Card — A low profile, server-qualified FPGA card capable of accelerating energy-efficient datacenter applications. Two independent banks of SDRAM memory and two optical network ports complete the balanced architecture capable of both co-processing and latency-critical 1G/10G/40G streaming applications.
  • 385A-SoC™ System on Chip FPGA Accelerator Card – A powerful computing and I/O platform for SoC FPGA and ARM-based development and deployment across a range of application areas including HPC, image processing and network analytics. The 385A-SoC™ is capable of being used in “stand-alone” mode running Linux and software stacks on the embedded ARM processors allowing the accelerator to be used without a host server for ultimate Size, Weight and Power (SWAP) performance.
  • 510T™ Compute Acceleration Card – FPGA co-processor designed to deliver ultimate performance per watt for compute-intensive datacenter applications. A GPU-sized 16-lane PCIe Gen 3 card, the 510T™ features two Intel Arria 10 FPGAs delivering up to sixteen times the performance of the previous generation. Applications can achieve a total sustained performance of up to 3 TFlops. The 510T™ card is available with almost 300GByte/sec of peak external memory bandwidth configured as eight independent banks of DDR4. This combination, plus the FPGA’s on-chip memory bandwidth of 14.4TBytes/sec, permits dramatic new levels of performance per watt for memory-bound applications.
  • Nallatech® FPGA Accelerated Compute Node® – Develop and deploy quickly with minimal risk using a Nallatech FPGA accelerator, tools and IP delivered pre-integrated in a server of your choice.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

SuperComputing 2016 – Nallatech Exhibiting 2017-03-20T10:42:47+00:00

Microsoft Bets its Future on a Reprogrammable Computer Chip

Prototype hardware for Project Catapult, a six-year effort to rebuild Microsoft’s online empire for the next wave of AI—and more. CLAYTON COTTERELL FOR WIRED

IT WAS DECEMBER 2012, and Doug Burger was standing in front of Steve Ballmer, trying to predict the future.

Ballmer, the big, bald, boisterous CEO of Microsoft, sat in the lecture room on the ground floor of Building 99, home base for the company’s blue-sky R&D lab just outside Seattle. The tables curved around the outside of the room in a U-shape, and Ballmer was surrounded by his top lieutenants, his laptop open. Burger, a computer chip researcher who had joined the company four years earlier, was pitching a new idea to the execs. He called it Project Catapult.

 

Doug Burger

Doug Burger. CLAYTON COTTERELL FOR WIRED

The tech world, Burger explained, was moving into a new orbit. In the future, a few giant Internet companies would operate a few giant Internet services so complex and so different from what came before that these companies would have to build a whole new architecture to run them. They would create not just the software driving these services, but the hardware, including servers and networking gear. Project Catapult would equip all of Microsoft’s servers—millions of them—with specialized chips that the company could reprogram for particular tasks.

But before Burger could even get to the part about the chips, Ballmer looked up from his laptop. When he visited Microsoft Research, Ballmer said, he expected updates on R&D, not a strategy briefing. “He just started grilling me,” Burger says. Microsoft had spent 40 years building PC software like Windows, Word, and Excel. It was only just finding its feet on the Internet. And it certainly didn’t have the tools and the engineers needed to program computer chips—a task that’s difficult, time consuming, expensive, and kind of weird. Microsoft programming computer chips was like Coca Cola making shark fin soup.

Project Catapult

The current incarnation of Project Catapult. CLAYTON COTTERELL FOR WIRED

Burger—trim, only slightly bald, and calmly analytical, like so many good engineers—pushed back. He told Ballmer that companies like Google and Amazon were already moving in this direction. He said the world’s hardware makers wouldn’t provide what Microsoft needed to run its online services. He said that Microsoft would fall behind if it didn’t build its own hardware. Ballmer wasn’t buying it. But after awhile, another voice joined the discussion. This was Qi Lu, who runs Bing, Microsoft’s search engine. Lu’s team had been talking to Burger about reprogrammable computer chips for almost two years. Project Catapult was more than possible, Lu said: His team had already started.

Today, the programmable computer chips that Burger and Lu believed would transform the world—called field programmable gate arrays—are here. FPGAs already underpin Bing, and in the coming weeks, they will drive new search algorithms based on deep neural networks—artificial intelligence modeled on the structure of the human brain—executing this AI several orders of magnitude faster than ordinary chips could. As in, 23 milliseconds instead of four seconds of nothing on your screen. FPGAs also drive Azure, the company’s cloud computing service. And in the coming years, almost every new Microsoft server will include an FPGA. That’s millions of machines across the globe. “This gives us massive capacity and enormous flexibility, and the economics work,” Burger says. “This is now Microsoft’s standard, worldwide architecture.”

03_wired_microsoft_0135-1

Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam Catapult team members Adrian Caulfield, Eric Chung, Doug Burger, and Andrew Putnam CLAYTON COTTERELL FOR WIRED

This isn’t just Bing playing catch-up with Google. Project Catapult signals a change in how global systems will operate in the future. From Amazon in the US to Baidu in China, all the Internet giants are supplementing their standard server chips—central processing units, or CPUs—with alternative silicon that can keep pace with the rapid changes in AI. Microsoft now spends between $5 and $6 billion a year for the hardware needed to run its online empire. So this kind of work is “no longer just research,” says Satya Nadella, who took over as Microsoft’s CEO in 2014. “It’s an essential priority.” That’s what Burger was trying to explain in Building 99. And it’s what drove him and his team to overcome years of setbacks, redesigns, and institutional entropy to deliver a new kind of global supercomputer.

A Brand New, Very Old Kind of Computer Chip
In December of 2010, Microsoft researcher Andrew Putnam had left Seattle for the holidays and returned home to Colorado Springs. Two days before Christmas, he still hadn’t started shopping. As he drove to the mall, his phone rang. It was Burger, his boss. Burger was going to meet with Bing execs right after the holiday, and he needed a design for hardware that could run Bing’s machine learning algorithms on FPGAs.

Putnam pulled into the nearest Starbucks and drew up the plans. It took him about five hours, and he still had time for shopping.

Burger, 47, and Putnam, 39, are both former academics. Burger spent nine years as a professor of computer science at the University of Texas, Austin, where he specialized in microprocessors and designed a new kind of chip called EDGE. Putnam had worked for five years as a researcher at the University of Washington, where he experimented with FPGAs, programmable chips that had been around for decades but were mostly used as a way of prototyping other processors. Burger brought Putnam to Microsoft in 2009, where they started exploring the idea that these chips could actually accelerate online services.

04_wired_microsoft_0330-crop1-768x1024

Project Catapult Version 1, or V1, the hardware that Doug Burger and team tested in a data center on Microsoft’s Seattle campus. CLAYTON COTTERELL FOR WIRED

Even their boss didn’t buy it. “Every two years, FGPAs are ‘finally going to arrive,’” says Microsoft Research vice president Peter Lee, who oversees Burger’s group. “So, like any reasonable person, I kind of rolled my eyes when this was pitched.” But Burger and his team believed this old idea’s time had come, and Bing was the perfect test case.

Microsoft’s search engine is a single online service that runs across thousands of machines. Each machine is driven by a CPU, and though companies like Intel continue to improve them, these chips aren’t keeping pace with advances in software, in large part because of the new wave in artificial intelligence. Services like Bing have outstripped Moore’s Law, the canonical notion that the number of transistors in a processor doubles every 18 months. Turns out, you can’t just throw more CPUs at the problem.

But on the other hand, it’s generally too expensive to create specialized, purpose-built chips for every new problem. FPGAs bridge the gap. They let engineers build chips that are faster and less energy-hungry than an assembly-line, general-purpose CPU, but customizable so they handle the new problems of ever-shifting technologies and business models.

At that post-holiday meeting, Burger pitched Bing’s execs on FPGAs as a low-power way of accelerating searches. The execs were noncommittal. So over the next several months, Burger and team took Putnam’s Christmas sketch and built a prototype, showing that it could run Bing’s machine learning algorithms about 100 times faster. “That’s when they really got interested,” says Jim Larus, another member of the team back then who’s now a dean at Switzerland’s École Polytechnique Fédérale in Lausanne. “They also started giving us a really hard time.”

The prototype was a dedicated box with six FPGAs, shared by a rack full of servers. If the box went on the frizz, or if the machines needed more than six FPGAs—increasingly likely given the complexity of the machine learning models—all those machines were out of luck. Bing’s engineers hated it. “They were right,” Larus says.

So Burger’s team spent many more months building a second prototype. This one was a circuit board that plugged into each server and included only one FPGA. But it also connected to all the other FPGA boards on all the other servers, creating a giant pool of programmable chips that any Bing machine could tap into.

That was the prototype that got Qi Lu on board. He gave Burger the money to build and test over 1,600 servers equipped with FPGAs. The team spent six months building the hardware with help from manufacturers in China and Taiwan, and they installed the first rack in an experimental data center on the Microsoft campus. Then, one night, the fire suppression system went off by accident. They spent three days getting the rack back in shape—but it still worked.

Over several months in 2013 and 2014, the test showed that Bing’s “decision tree” machine-learning algorithms ran about 40 times faster with the new chips. By the summer of 2014, Microsoft was publicly saying it would soon move this hardware into its live Bing data centers. And then the company put the brakes on.

Searching for More Than Bing
Bing dominated Microsoft’s online ambitions in the early part of the decade, but by 2015 the company had two other massive online services: the business productivity suite Office 365 and the cloud computing service Microsoft Azure. And like all of their competitors, Microsoft executives realized that the only efficient way of running a growing online empire is to run all services on the same foundation. If Project Catapult was going to transform Microsoft, it couldn’t be exclusive to Bing. It had to work inside Azure and Office 365, too.

The problem was, Azure executives didn’t care about accelerating machine learning. They needed help with networking. The traffic bouncing around Azure’s data centers was growing so fast, the service’s CPUs couldn’t keep pace. Eventually, people like Mark Russinovich, the chief architect on Azure, saw that Catapult could help with this too—but not the way it was designed for Bing. His team needed programmable chips right where each server connected to the primary network, so they could process all that traffic before it even got to the server.

05_microsoft_chipsc

The first prototype of the FPGA architecture was a single box shared by a rack of servers (Version 0). Then the team switched to giving individual servers their own FPGAs (Version 1). And then they put the chips between the servers and the overall network (Version 2).Click to Open Overlay Gallery

So the FPGA gang had to rebuild the hardware again. With this third prototype, the chips would sit at the edge of each server, plugging directly into the network, while still creating pool of FPGAs that was available for any machine to tap into. That started to look like something that would work for Office 365, too. Project Catapult was ready to go live at last.

Larus describes the many redesigns as an extended nightmare—not because they had to build a new hardware, but because they had to reprogram the FPGAs every time. “That is just horrible, much worse than programming software,” he says. “Much more difficult to write. Much more difficult to get correct.” It’s finicky work, like trying to change tiny logic gates on the chip.

Now that the final hardware is in place, Microsoft faces that same challenge every time it reprograms these chips. “It’s a very different way of seeing the world, of thinking about the world,” Larus says. But the Catapult hardware costs less than 30 percent of everything else in the server, consumes less than 10 percent of the power, and processes data twice as fast as the company could without it.

The rollout is massive. Microsoft Azure uses these programmable chips to route data. On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets. And according to one Microsoft employee, Office 365 is moving toward using FPGAs for encryption and compression as well as machine learning—for all of its 23.1 million users. Eventually, Burger says, these chips will power all Microsoft services.

Wait—This Actually Works?
“It still stuns me,” says Peter Lee, “that we got the company to do this.” Lee oversees an organization inside Microsoft Research called NExT, short for New Experiences and Technologies. After taking over as CEO, Nadella personally pushed for the creation of this new organization, and it represents a significant shift from the 10-year reign of Ballmer. It aims to foster research that can see the light of day sooner rather than later—that can change the course of Microsoft now rather than years from now. Project Catapult is a prime example. And it is part of a much larger change across the industry. “The leaps ahead,” Burger says, “are coming from non-CPU technologies.”

Peter Lee

Peter Lee. Peter Lee. CLAYTON COTTERELL FOR WIRED

All the Internet giants, including Microsoft, now supplement their CPUs with graphics processing units, chips designed to render images for games and other highly visual applications. When these companies train their neural networks to, for example, recognize faces in photos—feeding in millions and millions of pictures—GPUs handle much of the calculation. Some giants like Microsoft are also using alternative silicon to execute their neural networks after training. And even though it’s crazily expensive to custom-build chips, Google has gone so far as to design its own processor for executing neural nets, the tensor processing unit.

With its TPUs, Google sacrifices long-term flexibility for speed. It wants to, say, eliminate any delay when recognizing commands spoken into smartphones. The trouble is that if its neural networking models change, Google must build a new chip. But with FPGAs, Microsoft is playing a longer game. Though an FPGA isn’t as fast as Google’s custom build, Microsoft can reprogram the silicon as needs change. The company can reprogram not only for new AI models, but for just about any task. And if one of those designs seems likely to be useful for years to come, Microsoft can always take the FPGA programming and build a dedicated chip.

 

 

 

07_wired_microsoft_0413-1-1024x819

A newer version of the final hardware, V2, a card that slots into the end of each Microsoft server and connects directly to the network. CLAYTON COTTERELL FOR WIRED

Microsoft’s services are so large, and they use so many FPGAs, that they’re shifting the worldwide chip market. The FPGAs come from a company called Altera, and Intel executive vice president Diane Bryant tells me that Microsoft is why Intel acquired Altera last summer—a deal worth $16.7 billion, the largest acquisition in the history of the largest chipmaker on Earth. By 2020, she says, a third of all servers inside all the major cloud computing companies will include FPGAs.

It’s a typical tangle of tech acronyms. CPUs. GPUs. TPUs. FPGAs. But it’s the subtext that matters. With cloud computing, companies like Microsoft and Google and Amazon are driving so much of the world’s technology that those alternative chips will drive the wider universe of apps and online services. Lee says that Project Catapult will allow Microsoft to continue expanding the powers of its global supercomputer until the year 2030. After that, he says, the company can move toward quantum computing.

Later, when we talk on the phone, Nadella tells me much the same thing. They’re reading from the same Microsoft script, touting a quantum-enabled future of ultrafast computers. Considering how hard it is to build a quantum machine, this seems like a pipe dream. But just a few years ago, so did Project Catapult.

Correction: This story originally implied that the Hololens headset was part of Microsoft’s NExT organization. It was not.

 

Microsoft Bets its Future on a Reprogrammable Computer Chip 2017-03-20T10:42:47+00:00

FPGA Acceleration of Convolutional Neural Networks

Nalllatech Whitepaper – FPGA Accelerated CNN

Introduction – CNN – Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been shown to be extremely effective at complex image recognition problems. This white paper discusses how these networks can be accelerated using FPGA accelerator products from Nallatech, programmed using the Altera OpenCL Software Development Kit. This paper then describes how image categorization performance can be significantly improved by reducing computation precision. Each reduction in precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center and by community contributors.

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:
• Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
• Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
• Rectified-Linear: Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
• InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network
Figure 1 : Example illustration of a typical CNN – Convolutional Neural NetworkTo access the accelerated FPGA version of the code the user need only change the description of the CNN layer in the Caffe XML network description file to target the FPGA equivalent.

AlexNet

Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : AlexNet CNN – Convolutional Neural Network

AlexNet is a well know and well used network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the AlexNet CNN, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the AlexNet CNN. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. To create different layers optimized for the different convolution layers would be inefficient. This is because the computational time of each layer differs depending upon the number of filters applied and the size of the input images. due to the number of input and output features processed. However, each convolution requires a different number of layers and a different number of pixels to process. By increasing the resource applied to more compute intensive layers, each layer can be balanced to complete in the same amount of time. Hence, it is therefore possible to create a pipelined process that can have several images in flight at any one time maximizing the efficiency of the logic used. I.e. most processing elements are busy most of the time.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 1 : ImageNet layer computation requirements

Table 1 shows the computation required for each layer of the Imagenet network. From this table it can be seen that the 5×5 convolution layer requires more compute than the other layers. Therefore, more processing logic for the FPGA will be required for this layer to be balanced with the other layers.

The inner product layers have a n to n mapping requiring a unique coefficient for each multiply add. Inner product layers usually require significantly less compute than convolutional layers and therefore require less parallelization of logic. In this scenario it makes sense to move the Inner Product layers onto the host CPU, leaving the FPGA to focus on convolutions.

FPGA logic areas

FPGA devices have two processing regions, DSP and ALU logic. The DSP logic is dedicated logic for multiply or multiply add operators. This is because using ALU logic for floating point large (18×18 bits) multiplications is costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating pointer operations. To increase the performance for CNN processing it is necessary to increase the number of multiplications that be implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 and Stratix 10 devices from Altera have embedded floating DSP blocks that can also be used as fixed point multiplications. Each DSP component can in fact be used as two separated 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 3 : Arria 10 floating point DSP configuration

Figure 3 : Arria 10 floating point DSP configuration

If a reduced precision floating point processing is required it is possible to use half precision. This requires additional logic from the FPGA fabric, but doubles the number of floating point calculations possible, assuming the lower bit precision is still adequate.

One of the key advantages of the pipeline approach described in this white paper is ability to vary accuracy at different stages of the pipeline. Therefore, resources are only used where necessary, increasing the efficiency of the design.

Figure 4 : Arria 10 fixed point DSP configuration


Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNNs application tolerance, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU logic. This doubles the number of multiplications possible compared to just using the FPGA DSP logic. Some networks maybe tolerant to even lower bit precision. The FPGA can handle all precisions down to a single bit if necessary.

For the CNN layers used by AlexNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than a 1% error versus a single precision floating point operation.

CNN convolution layers

Using a sliding window technique, it is possible to create convolution kernels that are extremely light on memory bandwidth.

Figure 5 : Sliding window for 3x3 convolution

Figure 5 : Sliding window for 3×3 convolution

Figure 5 illustrates how data is cached in FPGA memory allowing each pixel to be reused multiple times. The amount of data reuse is proportional to the size of the convolution kernel.

As each input layer influences all output layers in a CNN convolution layer it is possible to process multiple input layers simultaneously. This would increase the external memory bandwidth required for loading layers. To mitigate the increase all data, except for coefficients, is stored in local M20K memory on the FPGA device. The amount on chip memory on the device limits the number of CNN layers that can be implemented.

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Most CNN features will fit within a single M20K memory and with thousands of M20Ks embedded in the FPGA fabric, the total memory bandwidth available for convolution features in parallel is in the order of 10’s Terabytes/sec.

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Depending upon the amount of M20K resource available it is not always possible to fit a complete network on a single FPGA. In this situation, multiple FPGA’s can be connected in series using high speed serial interconnects. This allows the network pipeline to be extended until sufficient resource is available.
A key advantage to this approach is it does not rely on batching to maximize performance, therefore the latency is very low, important for latency critical applications.

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Balancing the time taken between layers to be the same requires adjusting the number of parallel input layers implemented and the number of pixels processed in parallel.

Figure 9: Resources for 5x5 convolution layer of Alexnet

Figure 9: Resources for 5×5 convolution layer of Alexnet

 

Figure 9 lists the resources required for the 5×5 convolution layer of Alexnet with 48 parallel kernels, for both a single precision and 16 bit fixed point version on an Intel Arria10 FPGA. The numbers include the OpenCL board logic, but illustrate the benefits of lower precision has on resource.

Fully Connected Layer
Processing of a fully connected layer requires a unique coefficient for each element and therefore quickly becomes memory bound with increasing parallelism. The amount of parallelism required to keep pace with convolutional layers would quickly saturate the FPGA’s off chip memory, therefore it is proposed that his stage of the input layers either batched or pruned.

As the number of elements for an inner product layer is small the amount of storage required for batching is small versus the storage required for the convolution layers. Batching layers then allows the same coefficient to be used for each batched layer reducing the external memory bandwidth.

Pruning works by studying the input data and ignoring values below a threshold. As fully connected layers are placed at the later stages of a CNN network, many possible features have already been eliminated. Therefore, pruning can significantly reduce the amount of work required.

Resource
The key resource driver of the network is the amount of on chip M20K memories available to store the outputs of each layer. This is constant and independent of the amount of parallelism achieved. Extending the network over multiple FPGA’s increases the total amount of M20K memory available and therefore the depth of the CNN that can be processed.

Conclusion
The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of FPGA implementation allows single frame latency for object recognition, ideal for situations where low latency is crucial. E.g. object avoidance.

Using this approach for AlexNet (single precision for layer 1, then using 16 bit fixed for remaining layers), each image can be processed in ~1.2 milliseconds with a single Arria 10 FPGA, or 0.58 milliseconds with two FPGAs in series.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Convolutional Neural Networks 2017-11-13T15:14:58+00:00

Support Lounge Login

Forgot Password?

Join Us

Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.