Altera’s launch of OpenCL support for FPGA systems has ushered in a new era in high performance computing using CPUs and FPGAs in a hybrid computing model. Altera’s OpenCL Compiler (ACL) support for FPGA cards:
• Gives programmers easy access to the power of FPGA computing.
• Offers significantly higher performance at much lower power than is available using other technologies.
• Provides significant time-to-market advantage compared to traditional FPGA development using a hardware description languages.
• Automatically abstract details of hardware design for designers.
Implementing FPGA designs with the OpenCL compiler allows a designer to easily offload parts of their algorithm to the FPGA to increase performance, lower power and improve productivity.
This parallel programming methodology uses a kernel approach where data is passed to the specified kernel or processing. The kernel code uses C language with a minimal set of extensions that allows parts of the application code or sub routines to take advantage of parallel performance by processing via the FPGA.
This application note illustrates how to perform AES encryption on FPGAs using the OpenCL tool flow.
Advanced Encryption Standard (AES)
The Advanced Encryption Standard (AES) is a symmetric-key encryption standard that has been adopted by the U.S. government. AES can have block ciphers of 128, 192 and 256 bits in width, all of which require data in 128 bit blocks. The AES algorithm consists of multiple bit shifts and Exclusive Or (XOR) operations that make it an ideal candidate for acceleration on FPGAs.
AES operates on a 4×4 array of bytes, termed the state (different versions of AES with a larger block size have additional columns in the state). AES consists of four distinct processing stages, as listed below:
1. Key Expansion – The round keys are derived from the cipher key using the Rijndael’s key schedule.
2. Initial Round :
a. Add Round Key: Each byte of the state is combined with the round key using a bitwise XOR.
a. Sub Bytes: A non-linear substitution step where each byte is replaced with another using a lookup table.
b. Shift Rows: A transposition step where each row of the state is shifted cyclically a certain number of steps.
c. Mix Columns: A mixing operator which operators on the columns of the state, combining the four bytes in each column.
d. Add Round Key
4. Final Round
a. Sub Bytes
b. Shift Rows
c. Add Round Key
In this implementation the host processor performs the key expansion and the results are passed to the AES encryption algorithm on the FPGA. The key schedule process varies infrequently, depending on session key changes, so there is not significant performance impact with this approach.
ECB and CTR Ciphers
Electronic Codebook (ECB) is the simplest cipher mode to program on an FPGA. It is easily replicated multiple times and can be pipelined as the output has no effect on the next result. ECB can be seen in Figure 1.
2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined criteria, such as a threshold value. The technique can also the extended to volumes. Use cases include medical imaging volume analysis (e.g. MRI results), core porosity analysis (E.g. Oil & Gas) and many other connectivity analysis problems.
A technique for 2D component labeling is presented here, with a follow on section describing how this can be extended to 3D volumes. This paper shows how it is possible to dramatically accelerate the 3D component matching on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming. For 2D component matching several algorithms are commonly cited, the following are two examples…
One component at a time
The 2D image is scanned until a pixel meets the required criteria. The pixel’s neighbors are then analyzed and a linked list is created of the connected neighbors. This process is repeated recursively until no more connected neighbors are found. All pixels that were part of a connected linked list are assigned the same index. The index is then incremented and the next unconnected point on the image is analyzed. The process continues until the entire image is scanned. This technique can easily be adapted for 3 dimensions, AKA: 3D component matching.
The random traversal through memory required for this approach places the performance bottle neck on system memory bandwidth.
For two pass algorithm the image is scanned linearly from the top left corner to the bottom right corner. A component is given an ID according to the minimum value of its neighbors. If no neighbor exists the ID value is incremented and the pixel is set to this value.
The Lattice Boltzmann Method (LBM) is a technique for simulating the movement of complex fluid systems. Fluid systems are used in many industries to transmit signals and power using a network of tanks, pipes, values, pumps and other flow devices. Examples of applications include industrial processing, vehicular control and medical appliances. It is important that companies using fluid systems in this way have a systematic method of mathematically modelling different types of fluid systems for safe and reliable operation. This can be achieved, but typically at significant computational cost. Computational Fluid Dynamics (CFD) is one of the most demanding branches of high-performance computing (HPC), in terms of resources. There is constant demand for cheaper, faster CFD computing platforms. This paper describes how it is possible to dramatically accelerate the LBM technique on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming.
Traditional CFD methods solve the conservation equations for mass, energy, etc., whereas the LBM model uses particles to propagate these quantities. To simulate every particle in a system would be impossible, hence the LBM technique uses particle densities confined to a discrete lattice to simulate particle interactions.
The LBM technique is split into two stages: Collision and Streaming. The collision stage looks to balance the particle distributions. There are various techniques for finding an equilibrium, some more accurate than others. The operator used here is the Bhatnagar-Gross-Krook (BGK) operator.