Nalllatech Whitepaper – FPGA Accelerated CNN
Introduction – CNN – Convolutional Neural Network
Convolutional Neural Network (CNN) has been shown to be extremely effective at complex image recognition problems. This white paper discusses how CNN – Convolutional Neural Network computation can be accelerated using FPGA acceleration products from Nallatech, programmed using the Altera OpenCL Software Development Kit. Image categorization performance can be optimized by adjusting computation precision. Reduction in computational precision allows the FPGA accelerator to process increasingly more images per second.
Caffe is a deep learning framework made with expression, speed, and modularity in mind. It was developed and is maintained by the Berkeley Vision and Learning Center and by community contributors. http://caffe.berkeleyvision.org/
The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN – Convolutional Neural Network. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.
The most commonly used of these layers are:
- Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output
- Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum
- Rectified-Linear (ReLU): Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
- InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector
By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.
Figure 2 : ImageNet CNN – Convolutional Neural Network
ImageNet is a respected and well used CNN – Convolutional Neural Network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the ImageNet CNN – Convolutional Neural Network, however the approach used here would apply equally well to other networks.
Figure 2 illustrates the different network layers required by the ImageNet CNN – Convolutional Neural Network. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. Because the computational time of each layer differs depending upon the number of filters applied and the size of the input images, creating different layers optimized for the different convolution layers would be inefficient. To avoid this inefficiency, the smallest filter (3×3) is used as the basis for the larger convolutional blocks. The 3×3 convolutional layers are the most computationally demanding due to the number of input and output features processed by them.
The larger filter sizes can be represented as multiple passes of the smaller 3×3 filters. This adds inefficiency into the kernel processing, but allows for logic reuse between the different layers. The cost of this approach is illustrated in Table 1.
Table 1 : Convolution kernel efficiency
The 3×3 convolution kernel can also be used by the fully connected layers.
Table 2 : ImageNet layer computation requirements when using 3×3 filters
FPGA logic areas
FPGA devices have two processing resource types, DSP and ALU logic. The DSP logic is dedicated logic optimized for large (18×18 bits) floating point multiply or multiply add operators. This is much more efficient than using ALU logic where such multiplications are costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating point operations. To increase the performance for CNN – Convolutional Neural Network processing it is necessary to increase the number of multiplications that are implemented in the FPGA. One approach is to decrease the bit accuracy.
Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 devices from Altera have embedded floating DSP blocks that can also be used for fixed point multiplication. Each DSP component can in fact be used for two independent 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.
Figure 4 : Arria 10 fixed point DSP configuration
Depending upon the CNN – Convolutional Neural Network’s applications performance requirements, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU. This doubles the number of multiplications possible compared to just using the FPGA DSP logic.
OpenCL library functions
Altera has provided the ability to include user defined and optimized IP components into their compiler tool flow. This allows such optimized functions to be created and included using standard library notation. The library components allow an experienced HDL programmer to create highly efficient implementations in the same way an assembly language programmer would create and include optimized functions for x86.
For the CNN – Convolutional Neural Network layers used by ImageNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than 1% error versus a single precision floating point operation. Therefore an optimized library for a 10 bit 3×3 convolution was created. This library was then implemented (replicated) as many times as possible, limited by FPGA resource available.
Figure 5 : Arria 10 GX1150 resources
The Arria10’s largest available device is the GX 1150. This device has resource for ~512 convolution blocks, plus the application control logic.
Increasing the number of parallel convolution kernels increases the input bandwidth requirements. To avoid global memory becoming a bottleneck, multiple images are calculated at once allowing the convolution filter weights to be reused for each different image. This is particularly important for the fully connected layers where a new set of filter weights is required for each point to point connection, with the speed at which weights are retrieved from global memory the bottleneck. Fortunately the convolution layers reuse the weight data for each point in a feature image. The smallest convolution feature image is 13×13 pixels, therefore the convolution weights need only be updated every 169 iterations in the worst case.
Figure 6 : Nallatech 510T Accelerator
The hardware selected for this CNN – Convolutional Neural Network implementation was the Nallatech 510T – a GPU-sized FPGA accelerator card compatible with most server platforms designed to support Intel Xeon Phi or GPGPU accelerators. The Nallatech 510T features two Altera Arria 10 GX 1150 FPGAs with ~60 GBytes/sec external memory bandwidth for loading weights, input and output data. Typical power consumption of the 510T is only 150W – less than half the power consumption of a high-end GPGPU. An added bonus of using 10 bit coefficient data for the FPGA implementation is the tripling in the amount of weight data that can be read from global memory versus floating point data.
Using the Nallatech 510T accelerator, 16 parallel images can be processed with each image having 64 kernels processed in parallel. This was achieved by generating 8 output features and 8 pixels per feature in parallel. This gives a total of 1024 parallel 3×3 kernels.
In our implementation we created an OpenCL kernel system for 1 image and replicated this as many times as possible given the FPGA resource constraints. The convolution weights are reused for each image so there is minimal increase to global memory requirements when scaling to multiple parallel images.
By applying the above FPGA system, each image takes 9 millisecs to be categorized by the FPGA . With 12 parallel images handled by 510T this gives an average time of 748 usecs per image. This is over 115 million images per day.
Figure 7 : Images categorized per second. Nalllatech 510T versus Nvidia K401
The Nvidia K40 GPGPU has nominal power consumption of 235 Watts compared to the 150W of the Nallatech 510T. This gives the FPGA implementation a significant performance/power advantages versus the GPGPU.
Figure 8 : Relative Image/Power. Nallatech 510T versus Nvidia K401
1 Caffe implementation of ImageNet. http://caffe.berkeleyvision.org
The unique flexibility of FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN – Convolutional Neural Network calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.
The non-batching approach of an FPGA implementation allows for object recognition in 9 milliseconds (a single frame period), ideal for situations where low latency is crucial. E.g. object avoidance. This permits images to be categorized at a frame rate greater than 100 Hz.
The intrinsic scalability demonstrated by our FPGA implementation can be utilized to implement complex CNN – Convolutional Neural Networks on increasingly smaller and lower power FPGAs at the expense of some performance. This allows less demanding applications to be implemented on extremely low power FPGA devices, particularly useful for embedded solutions, E.g. Near sensor computing.
Figure 9 : Miniaturized packaging module for near sensor processing (FPGA, memory and support circuitry)
By packaging FPGAs with sensor hardware it is possible to use the power of CNN – Convolutional Neural Network image recognition near the sensor ensuring low latency is maintained and optimizing bandwidth between the sensor and the host.