FPGA Acceleration of Binary Neural Networks

BNN Binary Neural Networks

DOWNLOAD WHITEPAPER: FPGA Accelerated Binary Neural Network

Deep Learning

Until only a decade ago, Artificial Intelligence resided almost exclusively within the realm of academia, research institutes and science fiction. The relatively recent realization that Deep Learning techniques could be applied practically and economically, at scale, to solve real-world application problems has resulted in a vibrant eco-system of market players.

Now, almost every application area is in some way benefiting from Deep Learning – the leveraging of Artificial Neural Networks to learn from vast volumes of data to efficiently execute specific functions. From this field of neural network research and innovation, Convolutional Neural Networks (CNNs) have emerged as a popular deep learning technique for solving image classification and object recognition problems. CNNs exploit spatial correlations within the image sets by using convolution operations. CNNs are generally regarded as the neural network of choice – especially for low-power applications because they have fewer weights and are easier to train compared to fully connected networks which demand more resources.

Neural Networks

One approach to reduce the silicon count and therefore power required to execute a high performance neural network is to reduce the dynamic range of floating-point calculations. Using 16-bit floating-point arithmetic instead of 32 bits has shown to only slightly impact the accuracy of image classification. Furthermore, depending upon the network, the accuracy of the calculation can be reduced even further to fixed point or even single bits. This trend of improving overall efficiency through implementation of reduced calculation accuracy has led to the use of binary weights i.e. weights and input activations that are binarized with only two values: +1 and -1. This new variant is known as a Binary Neural Network (BNN). It reduces all fixed-point multiplication operations in the convolutional layers and fully connected layers to 1-bit XNOR operations.

Flexible FPGAs

Established classes of conventional computing technologies have attempted to evolve at pace to cater for this dynamic market. NVIDIA, for instance, has not only adapted the underlying GPU architecture and tools, but also their product strategy and value proposition. GP-GPUs, previously marketed as the ultimate double precision floating-point engines for graphics and demanding HPC applications are now being re-positioned for the Deep Learning CNN market where half-precision arithmetic support is critical for success.

Google, one of the strongest proponents of AI, has created its own dedicated hardware architecture, the Tensor Processing Unit (TPU), which is tightly coupled with their Machine Learning framework, TensorFlow. Other industry leaders, including hyperscale innovator Microsoft, have selected Field Programmable Gate Arrays (FPGAs) for their “Brainwave” AI architecture – a pipeline of persistent neural networks that promises to deliver real-time results. This choice is no doubt linked to the confidence they gained from the highly successful (and market disrupting) use of Intel-based Arria-10 FPGAs for Bing search indexing.

This white paper explains why FPGAs are uniquely positioned to address the dynamic roadmap requirements of neural networks of all bit ranges – in particular, BNNs.

Binary Neural Networks

Processing convolutions within CNN networks requires many millions of coefficients to be stored and processed. Traditionally, each of these coefficients are stored in a full single precision representation. Research has demonstrated that coefficients can be reduced to half precision without any material change to the overall accuracy while reducing storage capacity and memory bandwidth. More significantly, this approach also shorten the training and inference time. Most of the pre-trained CNN models available today use partial reduced precision.

Figure 1 : Converting weights to binary (mean = 0.12)

By using a different approach to the training of these coefficients the bit accuracy can be reduced to a single bit, plus a scaling factor 1. During training, the floating-point coefficients are converted to binarized values and scaling a factor by averaging all output feature coefficients and subtracting this average from the original value to produce a result that is either positive or negative, represented as either 1,0 in binary notation (
Figure 1). The output of the convolution is then multiplied by the mean.

FPGA Optimizations

Firstly, binarization of the weights reduces the external memory bandwidth and storage requirements by a factor of 32. The FPGA fabric can take advantage of this binarization as each internal memory block can be configured to have a port width ranging from 1 to 32 bits. Hence, the internal FPGA resource for storage of weights is significantly reduced, providing more space for parallelization of tasks.

The binarization of the network also allows the CNN convolutions to be represented as a series of additions or subtractions of the input activations. If the weight is binary 0 the input is subtracted from the result, if the weight is binary 1 it is added to the result. Each logic element in an FPGA has addition carry chain logic that can efficiently perform integer additions of virtually any bit length. Utilizing these components efficiently allows a single FPGA device to perform tens of thousands of parallel additions. To do so the floating-point input activations must be converted to fixed precision. Given the flexibility of the FPGA fabric, we can tune the number of bits used by the fixed additions to meet the requirement of the CNN. Analysis of the dynamic range of activations in various CNNs shows that only a handful of bits, typically 8, are required to maintain an accuracy to within 1% of a floating-point equivalent design. The number of bits can be increased if more accuracy is required.

Converting to fixed point for the convolution and removing the need for multiplications via binarization dramatically reduces the logic resources required within the FPGA. It this then possible to perform significantly more processing in the same FPGA compared to a single precision or half precision implementation.

Deep Learning models are becoming deeper by adding more and more convolution layers. Having the capability to stack all these layers into a single FPGA device is critical to achieving the best performance per watt for a given cost while retaining the lowest possible latency.

FPGA Implementation

The Intel FPGA OpenCL framework was used to create the CNNs described in this paper. To optimize the design further, the Nallatech research center developed IP libraries for the binary convolution and other bit manipulation operations. This provides a powerful mix programmability and efficiency.

Table 1: Approximate Yolo V3 layers

Table 1 : Approximate Yolo V3 layers

The network targeted for this white paper was the Yolo v3 network (Table 1). This network consists largely of convolution layers and therefore the FPGA has been optimized to be as efficient at convolutions as possible.

To achieve this, the design uses a HDL block of code to perform the integer accumulations required for binary networks, making for an extremely efficient implementation.

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 lists resource requirements for the accumulation of the 8-bit activation data when using binary weights. This is equivalent to 2048 floating-point operations, but only requires 2% of the device. Note, there is extra resource required by the FPGA to restructure the data (see Table 3), so it can be processed this way, however it does illustrate the dramatic reduction in resources that can be achieved versus a floating-point implementation.

The FPGA is also required to process the other layers of Yolo v3 to minimize the data copied over the PCIe interface. These layers require much less processing and therefore less of the FPGA resource is allocated to these tasks. In order for the network to train correctly, it was necessary for activation layers to be processed with single precision accuracy. Therefore, all layers other than the convolution are calculated at single precision accuracy.

The final convolution layer is also calculated in single precision to improve training and is processed on the host CPU. Table 3 details the resources required by the OpenCL kernels including all conversions from float to 8-bit inputs, the scaling of the output data and final floating-point accumulation.

Table 3 : Resource requirements for full Yolo v3 CNN kernel (% Arria 10 GX 1150)

FPGA Accelerator Platforms

The FPGA device targeted in this whitepaper is an Intel-based Arria-10. It is a mid-range FPGA fully supported within the Intel OpenCL Software Development Kit (SDK). Nallatech delivers this flexible, energy-efficient accelerator in the form of either an add-in PCIe card or integrated rackmount server. Applications developed in OpenCL are mapped onto the FPGA fabric using Nallatech’s Board Support Package (BSP) enabling customers (predominantly software rather than hardware focused) to remain at a higher level of abstraction than is typically the case with FPGA technology.

Nallatech’s flagship “520” accelerator card shown below features Intel’s new Stratix-10 FPGA. It is a PCIe add-in card compatible with server platforms supporting GPU-class accelerators. Ideal for scaling Deep Learning platforms cost effectively.

Performance

Each convolution block performs 2048 operations per clock cycle or ~0.5 TOPS per second for a typical Arria 10 device. 4 such kernels allow Yolo v3 to be run at a frame rate of ~8 frames sec for a power consumption of 35 Watts. This is equivalent to 57 GOPS/Watt.

XNOR Networks

It is possible to further reduce compute and storage requirements of CNNs by moving to a full XNOR network. Here both the weights and activations are represented as binary inputs. In this case a convolution is represented as a simple bitwise XNOR calculation, plus some bit counting logic. This is equivalent to the binary version described earlier except that activations are now only a single bit wide.

Speed-up of such networks is estimated at 2 orders of magnitude when running on FPGA. This disruptive performance improvement enables having multiple real-time inferences running in parallel on power efficient devices. XNOR networks require a different approach to training, where activations on the forward pass are converted to binary and a scaling factor.

Whereas binary networks show little degradation in accuracy, XNOR networks show 10-20%2 difference to a floating-point equivalent. However, this is using CNNs not designed specifically of XNOR calculations. As research into this area increases, it’s likely the industry will see new models designed with XNOR network in mind, that will provide a level of accuracy close to the best CNNs, while benefiting from the tremendous efficiency of this new approach.

Conclusion

This whitepaper has demonstrated that significant bit reductions can be achieved without adversely impacting the quality of application results. Binary Neural Networks (BNNs), a natural fit for the properties of the FPGA, can be up to thirty times smaller than classic CNNs – delivering a range of benefits including reductions in silicon usage, memory bandwidth, power consumption and clock speed.

Given their recognized strength for efficiently implementing fixed point computations, FPGAs are uniquely positioned to address the needs of BNNs. The inherent architecture flexibility of the FPGA empowers Deep Learning innovators and offers a fast-track deployment option for any new disruptive techniques that emerge. XNOR networks are predicted to deliver major improvements in image recognition for a range of cloud, edge and embedded applications.

Nallatech, a Molex company has over 25 years of FPGA expertize and is recognized as the market leader in FPGA platforms and tools. Nallatech’s complimentary design services allow customers to successfully port, optimize, benchmark and deploy FPGA-based Deep Learning solutions cost-effectively and with minimal risk.

Please visit www.nallatech.com or email contact@nallatech.com for further information.

This work has been partly developed as part of the OPERA project to provide offloading support for low powered traffic monitoring systems: www.operaproject.eu

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Binary Neural Networks 2018-06-25T15:54:26+00:00

FPGA Acceleration of Convolutional Neural Networks

DOWNLOAD WHITEPAPER: FPGA Accelerated CNN

Introduction – CNN – Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been shown to be extremely effective at complex image recognition problems. This white paper discusses how these networks can be accelerated using FPGA accelerator products from Nallatech, programmed using the Altera OpenCL Software Development Kit. This paper then describes how image categorization performance can be significantly improved by reducing computation precision. Each reduction in precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center and by community contributors.

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:
• Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
• Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
• Rectified-Linear: Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
• InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network
Figure 1 : Example illustration of a typical CNN – Convolutional Neural NetworkTo access the accelerated FPGA version of the code the user need only change the description of the CNN layer in the Caffe XML network description file to target the FPGA equivalent.

AlexNet

Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : AlexNet CNN – Convolutional Neural Network

AlexNet is a well know and well used network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the AlexNet CNN, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the AlexNet CNN. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. To create different layers optimized for the different convolution layers would be inefficient. This is because the computational time of each layer differs depending upon the number of filters applied and the size of the input images. due to the number of input and output features processed. However, each convolution requires a different number of layers and a different number of pixels to process. By increasing the resource applied to more compute intensive layers, each layer can be balanced to complete in the same amount of time. Hence, it is therefore possible to create a pipelined process that can have several images in flight at any one time maximizing the efficiency of the logic used. I.e. most processing elements are busy most of the time.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 1 : ImageNet layer computation requirements

Table 1 shows the computation required for each layer of the Imagenet network. From this table it can be seen that the 5×5 convolution layer requires more compute than the other layers. Therefore, more processing logic for the FPGA will be required for this layer to be balanced with the other layers.

The inner product layers have a n to n mapping requiring a unique coefficient for each multiply add. Inner product layers usually require significantly less compute than convolutional layers and therefore require less parallelization of logic. In this scenario it makes sense to move the Inner Product layers onto the host CPU, leaving the FPGA to focus on convolutions.

FPGA logic areas

FPGA devices have two processing regions, DSP and ALU logic. The DSP logic is dedicated logic for multiply or multiply add operators. This is because using ALU logic for floating point large (18×18 bits) multiplications is costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating pointer operations. To increase the performance for CNN processing it is necessary to increase the number of multiplications that be implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 and Stratix 10 devices from Altera have embedded floating DSP blocks that can also be used as fixed point multiplications. Each DSP component can in fact be used as two separated 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 3 : Arria 10 floating point DSP configuration

Figure 3 : Arria 10 floating point DSP configuration

If a reduced precision floating point processing is required it is possible to use half precision. This requires additional logic from the FPGA fabric, but doubles the number of floating point calculations possible, assuming the lower bit precision is still adequate.

One of the key advantages of the pipeline approach described in this white paper is ability to vary accuracy at different stages of the pipeline. Therefore, resources are only used where necessary, increasing the efficiency of the design.

Figure 4 : Arria 10 fixed point DSP configuration


Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNNs application tolerance, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU logic. This doubles the number of multiplications possible compared to just using the FPGA DSP logic. Some networks maybe tolerant to even lower bit precision. The FPGA can handle all precisions down to a single bit if necessary.

For the CNN layers used by AlexNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than a 1% error versus a single precision floating point operation.

CNN convolution layers

Using a sliding window technique, it is possible to create convolution kernels that are extremely light on memory bandwidth.

Figure 5 : Sliding window for 3x3 convolution

Figure 5 : Sliding window for 3×3 convolution

Figure 5 illustrates how data is cached in FPGA memory allowing each pixel to be reused multiple times. The amount of data reuse is proportional to the size of the convolution kernel.

As each input layer influences all output layers in a CNN convolution layer it is possible to process multiple input layers simultaneously. This would increase the external memory bandwidth required for loading layers. To mitigate the increase all data, except for coefficients, is stored in local M20K memory on the FPGA device. The amount on chip memory on the device limits the number of CNN layers that can be implemented.

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Most CNN features will fit within a single M20K memory and with thousands of M20Ks embedded in the FPGA fabric, the total memory bandwidth available for convolution features in parallel is in the order of 10’s Terabytes/sec.

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Depending upon the amount of M20K resource available it is not always possible to fit a complete network on a single FPGA. In this situation, multiple FPGA’s can be connected in series using high speed serial interconnects. This allows the network pipeline to be extended until sufficient resource is available.
A key advantage to this approach is it does not rely on batching to maximize performance, therefore the latency is very low, important for latency critical applications.

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Balancing the time taken between layers to be the same requires adjusting the number of parallel input layers implemented and the number of pixels processed in parallel.

Figure 9: Resources for 5x5 convolution layer of Alexnet

Figure 9: Resources for 5×5 convolution layer of Alexnet

Figure 9 lists the resources required for the 5×5 convolution layer of Alexnet with 48 parallel kernels, for both a single precision and 16 bit fixed point version on an Intel Arria10 FPGA. The numbers include the OpenCL board logic, but illustrate the benefits of lower precision has on resource.

Fully Connected Layer
Processing of a fully connected layer requires a unique coefficient for each element and therefore quickly becomes memory bound with increasing parallelism. The amount of parallelism required to keep pace with convolutional layers would quickly saturate the FPGA’s off chip memory, therefore it is proposed that his stage of the input layers either batched or pruned.

As the number of elements for an inner product layer is small the amount of storage required for batching is small versus the storage required for the convolution layers. Batching layers then allows the same coefficient to be used for each batched layer reducing the external memory bandwidth.

Pruning works by studying the input data and ignoring values below a threshold. As fully connected layers are placed at the later stages of a CNN network, many possible features have already been eliminated. Therefore, pruning can significantly reduce the amount of work required.

Resource
The key resource driver of the network is the amount of on chip M20K memories available to store the outputs of each layer. This is constant and independent of the amount of parallelism achieved. Extending the network over multiple FPGA’s increases the total amount of M20K memory available and therefore the depth of the CNN that can be processed.

Conclusion
The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of FPGA implementation allows single frame latency for object recognition, ideal for situations where low latency is crucial. E.g. object avoidance.

Using this approach for AlexNet (single precision for layer 1, then using 16 bit fixed for remaining layers), each image can be processed in ~1.2 milliseconds with a single Arria 10 FPGA, or 0.58 milliseconds with two FPGAs in series.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node
FACN

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA
520

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA
510T

Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA
385A

Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Convolutional Neural Networks 2018-06-25T15:27:26+00:00

Low Latency Key-Value Store / High Performance Data Center Services

Gateware Defined Networking® (GDN) – Search Low Latency Key-Value Store

Low Latency Key-Value Store (KVS) is an essential service for multiple applications. Telecom directories, Internet Protocol forwarding tables, and de-duplicating storage systems, for example, all need key-value tables to associate data with uniqu e identifiers. In datacenters, high performance KVS tables allow hundreds or thousands of machines to easily share data by simply associating values with keys and allowi ng client machines to read and write those keys and values over standard high-speed Ethernet.

Algo-Logic’s KVS leverages Gateware Defined Netw  orking® (GDN) on Field Programmable Gate Arrays (FPG As) to perform lookups with the lowest latency (less than  1 microsecond), with the highest throughput, and the least processing energy.   Deploying GDN solutions save netw ork operators’ time, cost, and power resulting in signifi cantly lower Total Cost of Ownership (TCO). DOWNLOAD FULL REPORT

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Data centers require many low-level network services to implement high-level applications. Key-Value Store (KVS) is a critical service that associates values with keys and allows machines to share these associations over a network. Most existing KVS systems run in software and scale out by running parallel processes on multiple microprocessor cores to increase throughput.

In this paper, we take an alternate approach by implementing an ultra-low-latency KVS in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, lookup transactions are sent over Ethernet to the machine that stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher search throughput with much lower latency and power consumption than other implementations in software. DOWNLOAD FULL REPORT

Low Latency - KVS Diagram
Low Latency - KVS PacketHandler
Low Latency Key-Value Store / High Performance Data Center Services 2017-03-20T10:42:47+00:00

40Gbit AES Encryption Using OpenCL and FPGAs

Introduction
Altera’s launch of OpenCL support for FPGA systems has ushered in a new era in high performance computing using CPUs and FPGAs in a hybrid computing model. Altera’s OpenCL Compiler (ACL) support for FPGA cards:

• Gives programmers easy access to the power of FPGA computing.
Offers significantly higher performance at much lower power than is available using other technologies.
Provides significant time-to-market advantage compared to traditional FPGA development using a hardware description languages.
Automatically abstract details of hardware design for designers.

Implementing FPGA designs with the OpenCL compiler allows a designer to easily offload parts of their algorithm to the FPGA to increase performance, lower power and improve productivity.

This parallel programming methodology uses a kernel approach where data is passed to the specified kernel or processing. The kernel code uses C language with a minimal set of extensions that allows parts of the application code or sub routines to take advantage of parallel performance by processing via the FPGA.

This application note illustrates how to perform AES encryption on FPGAs using the OpenCL tool flow.

Advanced Encryption Standard (AES)
The Advanced Encryption Standard (AES) is a symmetric-key encryption standard that has been adopted by the U.S. government. AES can have block ciphers of 128, 192 and 256 bits in width, all of which require data in 128 bit blocks. The AES algorithm consists of multiple bit shifts and Exclusive Or (XOR) operations that make it an ideal candidate for acceleration on FPGAs.

AES Operations
AES operates on a 4×4 array of bytes, termed the state (different versions of AES with a larger block size have additional columns in the state). AES consists of four distinct processing stages, as listed below:
1. Key Expansion – The round keys are derived from the cipher key using the Rijndael’s key schedule.
2. Initial Round :
a. Add Round Key: Each byte of the state is combined with the round key using a bitwise XOR.
3. Rounds
a. Sub Bytes: A non-linear substitution step where each byte is replaced with another using a lookup table.
b. Shift Rows: A transposition step where each row of the state is shifted cyclically a certain number of steps.
c. Mix Columns: A mixing operator which operators on the columns of the state, combining the four bytes in each column.
d. Add Round Key
4. Final Round
a. Sub Bytes
b. Shift Rows
c. Add Round Key

In this implementation the host processor performs the key expansion and the results are passed to the AES encryption algorithm on the FPGA. The key schedule process varies infrequently, depending on session key changes, so there is not significant performance impact with this approach.

ECB and CTR Ciphers
Electronic Codebook (ECB) is the simplest cipher mode to program on an FPGA. It is easily replicated multiple times and can be pipelined as the output has no effect on the next result. ECB can be seen in Figure 1.

Figure 1. Electronic Codebook (ECB) mode encryption

Figure 1. Electronic Codebook (ECB) mode encryption

The downside of ECB is identical plaintext blocks are encrypted in the same way which does not hide patterns in the data. A better approach is to use a Counter (CTR) approach, as seen in Figure 2. Here a counter is encrypted, incremented and XORed with the plaintext to create the output ciphertext.

Figure 2. Counter (CTR) mode encryption

Figure 2. Counter (CTR) mode encryption

The increment of the counter is arbitrary with the most common being a simple count. An added bonus of the CTR method is encryption and decryption logic are identical (Figure 3). The counter is simply reset before decrypting.

40-Gbit-AES-Encryption-OpenCL

Figure 3. Counter (CTR) mode decryption

CTR encryption also allows consecutive blocks of data to be encrypted in parallel by including a stride pattern into the counter.

OpenCL for FPGAs
The FPGA is effectively a ‘blank canvas’ on which a user can design an architecture fit for purpose.

When an OpenCL kernel is compiled using the ACL compiler, a processing architecture is designed around the needs of the algorithm. This includes integer and floating point logic built to the required depth and accuracy. The memory architecture is designed to meet the needs of the algorithm by utilizing the many hundreds of individually accessible memories available on the FPGA fabric.

Altera’s OpenCL compiler compiles OpenCL kernels, which in turn are compiled via Altera’s Quartus tools to create a SRAM object file (SOF). This SOF file can then be downloaded onto the FPGA. The OpenCL API for the host CPU allows OpenCL commands for controlling the compiled kernel source.

Describing AES encryption using OpenCL
The following pseudo code in Figure 4 describes the processing stages required for AES encryption. The nature of the AES encryption algorithm allows entire code to be unrolled into a single very deep pipeline containing thousands of integer operations. The ACL compiler has #pragma directives that can be added to a users OpenCL code to instruct nested loops to be unrolled, allowing the full AES code to be flattened. Only a few small number of changes to the original OpenCL source code are required.

Targeting a GPU with this modified source code is still possible as the #pragma directives are simply ignored. This allows the OpenCL code to be functionally verified quickly using a CPU or GPU prior to compilation to the required SOF file

40-Gbit-AES-Encryption-OpenCL-Fig.4

Figure 4. Counter (CTR) mode decryption

 

 

Once the SOF file is built the OpenCL kernel is downloaded onto the FPGA using the Altera Quartus tools. Just-in-time compilation of the kernel is not permitted due to the FPGA compile times, therefore kernels are loaded using the clCreateProgramWithBinary method. Altera also provides an OpenCL API library for the host to allow communication to support the FPGA accelerator card.

 

 

 

 

 

 

 

 

 

 

 

Target Technology

40-Gbit-AES-Encryption-OpenCL-Fig.5

Nallatech’s PCIe-385N accelerator card is supported by the ACL compiler and host OpenCL API. Figure 5 shows a top down view of the PCIe-385N. To target the Nallatech card the OpenCL kernel is compiled using the relevant compiler switch to target the 385.

The PCIe-385N features an 8-lane PCI Express Gen 3 capable interface for high speed host communications. The card also has 2 independent banks of DDR3 memory, totaling 16GBytes, coupled to the Stratix V.

AES Performance
The OpenCL FPGA program methodology allows a programmer to pick the number of “work-groups” that best fit the desired performance, whether this be as much as possible or tailored for a particular throughput. In this case the goal was to encrypt 40 Gbit Ethernet data which equates to a throughput of 5 GBytes per second. Note that although the PCIe-385N has a 10Gbit connection, the data is to be generated internally for testing purposes.

40-Gbit-AES-Encryption-OpenCL-Fig.6

Figure 6. ACL Output for single kernel

Compilation of the AES algorithm targeting a single work group yields a predicted throughput of 240 million work items/second. Each work item is a 16 Byte word giving a throughput of 3.8 GBytes/second. This would be insufficient to encrypt 40 Gbit data. Figure 6 shows the ACL output for a single AES kernel on the FPGA. Fortunately FPGAs are particularly efficient at integer arithmetic allowing more than one work-group to fit within the FPGA. Targeting multiple work-groups is done using a simple kernel attribute “__attribute((num_copies(n)))”. To achieve the desired 40 Gbit data rate only 2 copies of the AES kernel were required. Figure 7 shows the ACL output for a single AES kernel on the FPGA.

40-Gbit-AES-Encryption-OpenCL-Fig.7

Figure 7. ACL Output for two AES kernels

Device Utilisation
How much an algorithm uses of the FPGA is an important aspect of FPGA programming. An algorithm only utilizes the logic it requires within the FPGA, leaving the remaining logic unused and consuming minimal power. Therefore large power savings can be achieved by designing a kernel to meet only the needs of the target system, something that is not possible on CPU and GPU technologies.

There is no point designing an FPGA OpenCL kernel to run 100x faster, using say 100% of the device, when Amdahl’s law suggests only a maximum system increase of 10x is possible.

Performance
To measure the performance improvement the same OpenCL source code was compiled and ran on an AMD Radeon HD 7970 GPU card. This device has 2048 stream processors and an engine clock speed of 925 MHz. The FPGA design has 2 dedicated AES streams and a clock speed of only 170 MHz. The complexity of the AES encryption and the interdependency of the data results in a modest peak performance of ~0.33 GBytes/Sec throughput on this GPU. The FPGA AES streams are able to encrypt a full 16 Byte block every clock cycle to achieve 5.2 GBytes/Sec throughput. All performance figures reflect the kernel processing time only.

The power consumption of the FPGA accelerator is also significantly lower, requiring approximately 25 Watts compared to several hundred Watts on the GPU.

The AES source code was also compiled onto a 2GHz Intel Xeon E5503 processor achieving a performance of ~0.01 GBytes/sec per thread. The low throughput reflects upon the thousands of operators required for each 16 Byte output of the AES calculation and the limited parallel processing available to the CPU.

40-Gbit-AES-Encryption-OpenCL-Fig.8

Figure 8. Performance results for various technologies

Conclusion
To achieve 40 Gbits/second throughput for the AES encryption described here, only 42 % of the Stratix A7 FPGA device was utilized. The remainder could be left unused for power savings or extra kernels could be placed in parallel to the encryption core.

Altera’s ACL compiler allows easy access to FPGA accelerator technology. For the first time, utilizing OpenCL, code is truly portable between CPU, GPU and FPGA technologies. AES encryption is a new class of algorithm that can now be tackled using the OpenCL language which was not possible to perform efficiently on traditional compute platforms.

40Gbit AES Encryption Using OpenCL and FPGAs 2017-03-20T10:42:48+00:00

FPGA Acceleration of 3D Component Matching using OpenCL

Introduction
2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined criteria, such as a threshold value. The technique can also the extended to volumes. Use cases include medical imaging volume analysis (e.g. MRI results), core porosity analysis (E.g. Oil & Gas) and many other connectivity analysis problems.

Techniques
A technique for 2D component labeling is presented here, with a follow on section describing how this can be extended to 3D volumes. This paper shows how it is possible to dramatically accelerate the 3D component matching on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming. For 2D component matching several algorithms are commonly cited, the following are two examples…

One component at a time
The 2D image is scanned until a pixel meets the required criteria. The pixel’s neighbors are then analyzed and a linked list is created of the connected neighbors. This process is repeated recursively until no more connected neighbors are found. All pixels that were part of a connected linked list are assigned the same index. The index is then incremented and the next unconnected point on the image is analyzed. The process continues until the entire image is scanned. This technique can easily be adapted for 3 dimensions, AKA: 3D component matching.

The random traversal through memory required for this approach places the performance bottle neck on system memory bandwidth.

Two pass
For two pass algorithm the image is scanned linearly from the top left corner to the bottom right corner. A component is given an ID according to the minimum value of its neighbors. If no neighbor exists the ID value is incremented and the pixel is set to this value.

FPGA Acceleration of 3D Component Matching using OpenCL

Figure 1 : Component ID labelling

 

Figure 1 illustrates the surrounding pixels required to obtain the new ID for the current pixel. If the pixels A or D are non zero, differ from C and the pixel is valid, we have a condition where two ID’s clash. At this point the lowest ID is assigned to the maximum ID and a note of the swap is made in a lookup table. This lookup table is used after the image is scanned to replace ID’s that have been swapped for other ID’s in the image.

 

 

Figure 2: Components merging

Figure 2: Components merging (3D component matching)

After the image has been scanned the lookup table is used to replace merged ID’s to create the final connected component image. This is illustrated in Figure 2. The second stage is not necessary if the data is stored as both the pre-merged image and component ID’s. This avoids a costly rescan of the image. When the results are used later on, the pre-merged image is simply passed through the lookup table to produce the new image. Whether this is done or not depends upon the number of ID’s discovered and the amount of resource required to store the ID’s. This is dependent upon the size and complexity of the image.

The two pass algorithm is most suitable for implementation on an FPGA as the data access is linear and therefore suitable for a pipelined design.

3D component matching labelling
3D component matching labelling is typically performed using the “one component at a time” approach. However, for large images this can quickly move the volume outside of CPU cache and the CPU will start cache thrashing, significantly reducing the overall performance. For an FPGA approach we would also not be able to hold the volume data in local memory, limiting performance to the global memory bandwidth of the accelerator. To avoid these issues the two pass approach is applied to create a series of 2D component matched planes. The 2D planes are then combined using a similar approach to the 2D matching illustrated in Figure 2, however the previous plane in the z axis is also considered.

The number of individual ID’s required for a large volume would exceed the storage capacity of local FPGA memory. Therefore, a technique is applied where only the current and previous plane ID’s are stored. An ID that fails to appear and has not been linked with the current plane can be considered to be finished and will not occur again. At this point the ID is placed in global memory. This limits the number of ID’s to store in local memory to 2x the maximum number ID’s expected for any one plane.

Linking between planes is illustrated in Figure 3

Figure 3 : Using overlapping planes to connect components in 3D component matching

Figure 3 : Using overlapping planes to connect components in 3D component matching

It is often desirable to store statistics regarding the connective component data, such as the number of occurrences of an ID. In a similar fashion to the ID’s the current statistics of an ID can be stored in local memory. Only when an ID no longer exists are its statistics committed to global memory.

An added bonus of this approach is the ability to reuse planes for different volume analysis. If the purpose of the 3D connective component labelling is to produce spatial statistics on a volume, any overlapping volumes can reuse the 2D plane data without the need to recalculate. This can save significant amounts of processing time depending upon the amount of overlap that occurs.

FPGA implementation
The algorithm is relatively simple and contains no complex logic. Therefore it requires small amounts of compute resource relative to what is available in modern FPGA devices. Thanks to the techniques applied here, the algorithm also requires only a small percentage of the global memory bandwidth of what’s typically available on FPGA accelerator boards. Therefore it is possible to implement many parallel instantiations, either working on different thresholds or volumes until either resource or global memory bandwidth is exhausted.

The 3D volume cannot be subdivided for parallelisation, as the previous plane calculation is required prior to calculating the next. However, it is often desirable to process many different threshold values of a 3D volume in order to analysis boundaries between different materials, etc. Therefore, for this white paper it is assumed that multiple threshold values will be processed in parallel.

OpenCL Implementation
The AOC compiler provided by Altera allows users to target FPGA accelerators using the Khronos OpenCL standard. This section describes how to implement the connective 3D component matching technique using OpenCL and targeting FPGA devices.

Figure 4 : Nallatech 385 FPGA card - 3D component matching

Figure 4 : Nallatech 385 FPGA card

 

The FPGA device targeted here was a Stratix V A7 device. This device is a mid range FPGA of the Stratix V series and provides a good balance of on chip memory and logic gates.

In order to achieve good acceleration it was necessary to replicate the algorithm as many times as possible. For the implementation described here, two processing algorithms were implemented.

1. Creates the 2D connected component planes for 3D volume, at various different threshold values.
2. The connected 3D component matching algorithm that merges planes together to create a connected volume, recording volume statistics as desired.

This could be implemented as two different processing kernels, or as two distinct programs with the FPGA reprogrammed between stages 1 and 2. The latter is more desirable if re-use of plane data is expected and is the approach described here.

Kernel 1 : Creating the Planes
The OpenCL compiler allows two distinct programming approaches. The first is the traditional SIMD approach using an NDRange kernel, the second is a single work-item flow. This is the recommend approach by Altera if a design has loop or memory dependencies. A single work-item flow pipelines loops within the kernel, executing a new index every clock cycle if possible. This allows a technique referred to as a “sliding window” to be utilised massively reducing the impact on global memory bandwidth. The Sliding Window allows previously calculated rows to be stored in local memory removing the need to constantly refer to off chip global memory.

Figure 5 : Sliding window - 3D component matching

Figure 5 : Sliding window

 

With the sliding window implemented there is only one read and one write to global memory per pixel.

After the plane has been processed the new plane ID lookup table is stored in global memory ready for the second phase (Linking of planes).

It is possible to create multiple kernels on the FPGA accelerator, one for each threshold to processed, however as the FPGA is a blank canvas, every access to global memory must create its own memory controller circuitry. With 10’s of kernels implemented the memory control logic would occupy a large amount of the on device resource. To avoid replication of the memory circuitry we can create 2 kernels dedicated to handling global memory accesses, one for reading input data and another for writing output data; I.e. a producer and consumer kernel. These kernels then fan data out and consume results to and from multiple processing kernels. This prevents the unnecessary replication of global memory logic and allows more parallel paths to be implemented.

Figure 6 : Multiple kernels connected via channels - 3D component matching

Figure 6 : Multiple kernels connected via channels

Figure 6 shows the arrangement of consumer, producer and worker kernels used to implement multiple paths. The communication between kernels is done via channels. Each worker kernel receives data from its own channel and writes results back to its own output channel. Each worker kernel is therefore identical with the exception of the channel IDs.

 

 

 

 

 

Kernel 2 : Linking the planes
Once each plane has been created it is necessary to link the planes in order to create the 3D connected component volume. Again a sliding window is used to reduce the number of global memory accesses. In this case two inputs and one output are required as we require the current and previous planes input data.

Figure 7 : Back plane sliding window - 3D component matching

Figure 7 : Back plane sliding window

Figure 8 : Linking IDs between planes - 3D component matching

Figure 8 : Linking IDs between planes

 

Figure 8 illustrates the 9 pixels from the previous plane that are possible connections with the front pixel. As we scan along the current row any paths along the three back rows are tracked, see Figure 9. If no paths via the back or current plane are possible, a path is said to no longer exist and the current path ends. Once a row is complete the ID’s are then modified to equal the minimum ID found on the valid paths.

 

 

Figure 9: Link front plane to back plane - 3D component matching

Figure 9: Link front plane to back plane

 

As ID’s can change from one row to the next, an ID conversion must be applied for new back pixel read from the sliding window. This would ordinarily require 9 reads from the ID lookup table, however we can use the locality of the data to realise that adjacent pixels of the back plane must be equivalent. We can therefore reduce the 9 possible ID’s to just 4. This is convenient as the Altera FPGA devices permit 4 accesses to local memory simultaneously.

After the plane is completed the statistics of any IDs that no longer exist are stored in global memory to be retrieved by the host.

Multiple Binaries
An individual aocx (device binary) file was generated for the plane creation and for linking the planes. The host programs the device with the first binary and executes. The results produced by the first binary are placed in global memory. The device is then reprogrammed with the second binary and executes reading the previous binaries results from global memory. The final results are then retrieved by the host.

Benchmark
The following benchmark targeted Nallatechs p385n_hpc_a7 accelerator board. This allowed up to 8 parallel worker kernels to be instantiated in a single FPGA device. This was then compared to a single core of a Xeon E5-2430 2GHz device with a cache size of 15360 Kbytes. The Xeon implemented a “one component at a time” technique optimized for a CPU.

The performance improvement of the FPGA varies depending upon the complexity of the volume of data being analyzed. The more complex the image the better the FPGA performs compared to the CPU. When the image is sparse the FPGA has only a few ID’s to report, however the CPU does not traverse far through the volume when processing its data and acceleration is less. When the data is dense, the CPU must traverse to all points in a nonlinear fashion, whereas the FPGA linearly traverses the data with significant performance improvement.

To quantify the acceleration it is necessary to plot performance increase against the density of the image. Figure 10 shows the time taken to process 8 threshold values for varying density of the volume.

Figure 10 : Processing time versus density of valid data (%) (256x256x256 data points, 8 parallel thresholds)- 3D component matching

Figure 10 : Processing time versus density of valid data (%)
(256x256x256 data points, 8 parallel thresholds)

Figure 11 : Acceleration versus density of valid data (%) - 3D component matching

Figure 11 : Acceleration versus density of valid data (%)

As can be seen from Figure 10 the performance of the Xeon tails off quickly for volumes with a high percentage of valid data points. This is due to the linked list used to track the current position growing in complexity. The FPGA version does not require the storage of a linked list and is therefore unaffected by how densely packed the volume is. However, the FPGA performance is affected by the number of unique path IDs. Any IDs that must be merged will have their IDs swapped after the end of each row. The likelihood of this occurring increases with the number of IDs and therefore increases the time spent in the ID swapping logic. For a very dense volume the number of unique IDs reduces until there is just 1 ID for a nearly full volume. At this point the FPGA has to perform no ID swapping and the FPGA implementation is then at its most efficient.

Figure 12 : Processing time (seconds) versus percentage of volume occupied. (256x256x256 data points, 8 parallel thresholds) - 3D component matching

Figure 12 : Processing time (seconds) versus percentage of
volume occupied. (256x256x256 data points, 8 parallel thresholds)

Conclusion
Using OpenCL and FPGAs it is possible to significantly accelerate connected 3D component matching. With the memory efficient algorithm described here, it is possible to replicate the processing kernel multiple times. This technique should extend to future larger FPGAs with more resource. Therefore next generation FPGAs should yield significantly greater performance than what is demonstrated here.

The plane implementation will also scale to larger volumes. The only limitation will be on the number of IDs required for each plane. The number of potential unique IDs increases linearly with the area of the plane. These have to be store in local memory on the FPGA. However, there is no limit to the number of planes or depth of volume, global memory depth permitting.

Future Roadmap
The next generation of Altera FPGAs will provide an order of magnitude improvement over the results presented here. With the introduction of Stratix 10, Altera production will utilize 14nm TriGate transistor technology. The resulting higher clock speed and denser devices result in a step change in overall performance. Applying the performance gains to the 385 results presented earlier, demonstrates a greater than 10x performance improvement versus Stratix V.

 

Figure 13 : Acceleration Versus a single Xeon Core  3D component matching

Figure 13 : Acceleration Versus a single Xeon Core

FPGA Acceleration of 3D Component Matching using OpenCL 2017-03-20T10:42:48+00:00

FPGA Acceleration of Lattice Boltzmann using OpenCL

Introduction
The Lattice Boltzmann Method (LBM) is a technique for simulating the movement of complex fluid systems. Fluid systems are used in many industries to transmit signals and power using a network of tanks, pipes, values, pumps and other flow devices. Examples of applications include industrial processing, vehicular control and medical appliances. It is important that companies using fluid systems in this way have a systematic method of mathematically modelling different types of fluid systems for safe and reliable operation. This can be achieved, but typically at significant computational cost. Computational Fluid Dynamics (CFD) is one of the most demanding branches of high-performance computing (HPC), in terms of resources. There is constant demand for cheaper, faster CFD computing platforms. This paper describes how it is possible to dramatically accelerate the LBM technique on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming.

Techniques
Traditional CFD methods solve the conservation equations for mass, energy, etc., whereas the LBM model uses particles to propagate these quantities. To simulate every particle in a system would be impossible, hence the LBM technique uses particle densities confined to a discrete lattice to simulate particle interactions.

The LBM technique is split into two stages: Collision and Streaming. The collision stage looks to balance the particle distributions. There are various techniques for finding an equilibrium, some more accurate than others. The operator used here is the Bhatnagar-Gross-Krook (BGK) operator.

Lattice-Boltzman-OpenCL

Figure 1 : D2Q9 Lattice

 

Lattice
Different lattice topologies are possible for different dimensions and algorithm approaches. A popular way to classify lattices is the DnQm scheme, where n stands for the number of dimensions and m the lattice velocity distributions.

The lattice used in this white paper is a D2Q9 lattice illustrated in Figure 1.

 

LBM Maths
The following equation is the BGK operator applied to the 9 lattice points contributing to the current lattice point …

Equation-1-D2Q9-equilibrium-distribution-function

Equation 1 : D2Q9 equilibrium distribution function

 

 

 

 

 

Untitled-2

 

 

 

Once the new distributions are calculated, they must be distributed to neighboring lattice points. This is the Streaming stage of LBM.

Figure 2: Streaming stage

Figure 2: Streaming stage

Particle distributions are swapped between lattice points along the 8 non-zero direction vectors.

Implementations
The lattice Boltzmann code is a memory bound problem. For the D2Q9 lattice 9 floating point numbers must be read and updated for every lattice during the collision phase. Here data is read in a linear fashion, however the propagate stage must implement some out of order memory accesses to swap data between adjacent lattice points.

For a GPU implementation, it is the global memory access that ultimately limits the performance of the lattice Boltzmann code. FPGAs, however, offer an alternative approach that removes this memory bottleneck and provides almost unlimited scalability.

Lattice Boltzmann FPGA OpenCL
Typical OpenCL Lattice Boltzmann implementations work by creating hundreds of threads, all working in parallel, but ultimately limited by the global memory bandwidth available. The Altera OpenCL compiler offers an alternative OpenCL programming model that creates one or more pipelined kernels, where parallelism comes from the complexity of the pipeline. The more complex the pipeline, the more floating logic is performed in parallel.

FPGAs have significant local memory resources than can be configured in many different ways, from large single buffers to hundreds of small buffers. This flexibility allows the Altera OpenCL Compiler (AOC) to create memory topologies specifically designed for the algorithm that needs accelerating. The consequence of this is to significantly reduce the global memory bandwidth requirements of the algorithm.

The Collision stage of the LB algorithm accesses global memory in a linear fashion and needs no optimizations. However, the streaming stage requires data from the neighboring lattice points. The delivery of the data can be optimized by using a cached copy of the output in what is referred to as a Sliding Window. A Sliding Window approach allows data to be read linearly and buffered in local memory, from which data can be read as often as required.

Figure 3 : Sliding window

Figure 3 : Sliding window

 

The Sliding Window allows previously-calculated rows to be stored in local memory allowing the streaming stage to be combined with the collision stage. The entire algorithm can then be pipelined to generate a result every clock cycle. What’s more, multiple pipelines can be cascaded to gather with global memory access only required for input into the first stage and the output from the final stage. The number of Lattice points calculated per second, therefore, increases linearly per pipeline stage with no increase in global memory requirements.

table-01

Table 1

 

Table 1 lists the performance of the BGK D2Q9 algorithm for various technologies.

There are 106 floating point calculations required per LUT. This makes the sustained floating point performance equivalent to 106 multiplied by the LUTs/sec.

Figure 4: Multiple time steps implemented in a pipeline

Figure 4: Multiple time steps implemented in a pipeline

 

 

 

 

 

 

 

 

 

Performance

The pipelining allows the performance to be linearly improved with each new pipeline stage, until the resources of the FPGA are exhausted. Four such pipelines fit into a pcie385n_d5 part.

Figure 5 : PCie385n_d5 and Server

Figure 5 : PCie385n_d5 and Server

 

 

 

 

 

 

 

 

Figure 6 : Performance, MLUTs/Sec

Figure 6 : Performance, MLUTs/Sec

Power
When measuring HPC performance, it is important to consider the power footprint of different technologies. Figure 7 watt for the three technologies available to study.

Figure 7 : Performance, MLUTs/Sec/Watt

Figure 7 : Performance, MLUTs/Sec/Watt

Results
The following images show the output of the FPGA implementation. Yellow depicts the areas of fastest flow, whilst the black areas are slowest.

 

Figure 8 : Flow through a slit

Figure 8 : Flow through a slit

Figure 9 : Turbulent flow around a sphere

Figure 9 : Turbulent flow around a sphere

Figure 10 : Flow through a porous object

Figure 10 : Flow through a porous object

D3Q19 3D Lattice
The implementation described here can also be applied to a 3D lattice. In this case, the Sliding Window stores planes rather than lines of lattice data, which requires more internal memory and limits the plane size. Therefore, the cross section of the volume can be calculated using this approach. The depth of the lattice is, however, unlimited.

Conclusion
Using the OpenCL tool flow, it was possible to achieve significant acceleration of a well-known HPC problem using FPGA technology in only a few days of coding. By abstracting the complexities of FPGA interfaces and hardware description languages, OpenCL massively increases productivity without significantly sacrificing design performance. This allows developers to quickly verify the suitability of FPGA acceleration without committing to months/years of design effort. To learn more about the advantages of FPGA-based acceleration, please visit nallatech.com

 

FPGA Acceleration of Lattice Boltzmann using OpenCL 2017-03-20T10:42:48+00:00
Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.