FPGA-Accelerated NVMe Storage Solutions



In recent years, the migration towards NAND flash-based storage and the introduction of Non-Volatile Memory Express® (NVMe™) have multiplied the opportunities for technology companies to “do storage” differently1. The rapid growth and diversity of real-time digital businesses has demanded this innovation to allow new products and services to be realized. New storage products have therefore followed trends towards higher bandwidth, lower latency and a reduction in footprint and total cost of ownership – critical improvements for companies relying on large infrastructures. Recent market reports2 forecast that the NVMe market will grow at approximately 15% CAGR to reach $57 billion by 2020. The NVMe market continues to evolve and seeks further technological innovations in three areas:

(1) storage virtualization to increase flexibility and security
(2) localized data processing close to the stored data
(3) disaggregated storage for optimized infrastructures3

In March 2018, Nallatech announced the 250 series FPGA products which provides innovative solutions to cater to the needs of the storage market. The 250-series product features the Xilinx® UltraScale+™ FPGAs and MPSoCs which offers ASIC-class functionality in a single-chip and fits the technology needs of the storage industry6. By combining NVMe with reconfigurable logic FPGA and MPSoC, Nallatech is offering a new class of storage products with a critical differentiator in a fast-evolving market; the flexibility and reconfigurability of the Xilinx devices guarantees that 20-based solutions can remain current as the NVMe standard incorporates new features overtime5.

This application note describes how Nallatech’s 250 series of FPGA and MPSoC-enabled accelerator products can be used to allow customers to construct high-performance, scalable NVMe infrastructures for next generation IoT and cloud infrastructures.

The Technology & Characteristics

NVM Express is a high-throughput and low-latency storage technology which has increased the performance and flexibility of existing datacenter infrastructures8. By replacing current storage technologies like SAS and SATA with NVMe, IT architects have multiplied the available storage throughput of their datacenters6. The NVMe protocol sits on top of the PCIe transport layer and can operate over PCIe 3.0, with new drives expected to become available in 2018, NVMe will soon support PCIe 4.0 rates7. With NVMe over PCIe 3.0, NVMe can respectively transfer up to a maximum theoretical 4GB/s in the U.2 or M.2 4-lane form factor; for example, the Micron 9200 SSD U.2 drives display sustained sequential read/write performance up to 3.5GB/s according to the manufacturer’s website.

Beyond the additional read/write throughput, NVMe supports new features that improve a storage array over SAS or SATA equivalents. The NVMe protocol better fits the needs of enterprise environments by supporting 64K I/O queues. Previous technologies based on spinning disks showed limitations due to fewer queues. With NVMe, each of the host’s processes can have its own queue and manage it independently. Additionally, the NVMe specification provides a well-defined arbitration mechanism to assign priorities to each of the queues as well as support for MSI/MSI-X and interrupt aggregation. This finer grain tuning for read/write operations to the NVMe storage increases the overall system performance.

The NVMe specification also supports more efficient random I/O operations and outperforms SAS or SATA SSD technologies by a factor of 10 in 4KB random read tests. The performances improvement for 4KB transfers better aligns with the requirement of enterprise software applications and operating systems. The support for multiple namespaces and the efficient support of I/O virtualization architectures like SR-IOV make NVMe the technology of choice in the datacenter. The high performance of the NVMe technology facilitates the sharing of hardware resources between several Virtual Machines or users. For example, 25 IOPS is a relevant estimate for the requirements of a power user in a virtualized infrastructure. With 700,000 random 4K read IOPS, a single NVMe drive could fulfill the need of 500 power users where 25 SAS HDD would typically be needed8.

Bringing FPGAs and MPSoCs into the mix adds flexibility to an already performant NVMe datacenter infrastructure. The NVMe specification supports a superset of commands that not all NVMe drives support; by using IP-based NVMe controllers, adjustments to new NVMe equipment only requires a firmware update for reconfigurable logic instead of a hardware upgrade. In fact, over recent months, new innovative approaches to storage, like Open Channel, have caused a mini-revolution in the NVMe market. Large datacenter players are now pushing for a new NVMe hardware model. Microsoft recently introduced project Denali which specifies a methodology to disaggregate flash storage where the NVMe drives do not handle management functions, instead, new reconfigurable hardware does. With project Denali and Open Channel, the functions traditionally performed in the monolithic SSD drives are offloaded to the host or an FPGA/MPSoC accelerator. Nallatech’s 250 product series fits this new model and provides an agile solution which will adapt to the new models as they get finalized.

NVMe Roadmap

Since the creation of NVMe in 2011, the NVMe consortium has remained very active. In fact, the NVMe protocol is currently evolving from three perspectives defined in separate specifications. In addition to the base NVMe specification, the NVMe Management Interface (NVMe-MI) details how to manage communications and the devices (device discovery, monitoring, etc.) and the NVMe over Fabric (NVMe-oF) drives how to communicate with non-volatile storage over a network to present the protocol as transport agnostic9.

Over time, as more users from various industries start adopting NVMe, the new users characterize their need for new features and introduce new ideas for the specification. The adoption of the NVMe protocol is still growing and it is generating innovation. Hardware and software companies are finding new ways to get to the memory through the introduction of new form factors, the creation of new products and appliances, etc. The focus of the NVMe ecosystem is to give users the means to scale into the datacenter or hyperscale infrastructures, and the protocol specification will continue to evolve in that direction9.

2019 will see the release of revision 1.4 of the NVMe base specification which will lead to improvements in data latency, high-performance access to non-volatile data and ease of data sharing between several hosts. One of the features awaited by NVMe users, and cloud providers specifically, is IO determinism which will increase the quality of service during parallel execution of IOs10. By limiting the impact of background maintenance tasks to a minimum and containing the influence of noisy neighbors, the IO determinism feature will give users a consistent latency when accessing the non-volatile data. An alternative approach is the previously discussed Open Channel architecture11. With this second method, the host takes over some of the management functions and only the data travels to the storage hardware. In this configuration, the drive’s physical interface to the host is limited to high-speed data lanes, there are no sideband channels. This example shows the impact and relevance of any changes in the NVMe specification and highlights the requirements for a flexible NVMe hardware infrastructure.

As the new revisions of the base, MI and over Fabric specifications come out in the next few months, NVMe users will benefit from a flexible foundation which can adapt to the new NVMe requirements. The 250 series FPGA and MPSoC products provide this flexibility but also solve today’s customers’ challenges and give them an immediate competitive advantage.

Why FPGAs?

Nallatech’s FPGA and MPSoC products feature the very latest Xilinx UltraScale+ technology and fit the need of a datacenter increasingly focused on NVMe. FPGAs have provided programmable hardware solutions to multiple industries over three decades and are broadly used to solve computing and embedded systems problems in automotive, broadcasting, medical and military markets amongst others. At the same time, in recent years, FPGA manufacturers have introduced the latest and greatest in integrated systems design improvements to this proven technology.

The Xilinx UltraScale+ FPGA and MPSoC products use a 16nm process and improve the system performance by providing high-speed fabric, embedded RAM, clocking, and DSP processing. Besides, Xilinx devices have introduced faster transceiver technology (up to 32.75 Gb/s) for higher throughput connectivity into the network or the PCIe fabric. With their high count of serial transceiver channels, UltraScale+ products can connect to multiple PCIe interfaces at once and provide a data offload interface to a host CPU. In some cases, by replacing a PLX switch with an FPGA or MPSoC, the CPU can offload some of its processing and free up for other operations. The programmable logic of FPGA and MPSoC also provide a deterministic and low-latency interface in a system which can give a clear competitive advantage in some use cases.

Recent FPGA families now also include embedded low-power microprocessors inside the device fabric. The UltraScale+ MPSoCs match the need of applications that require software as well as programmable logic by combining them into a single package. For example, the Xilinx Zynq UltraScale+ ZU19EG features two processing units, one Quad-core ARM Cortex-A53 and one real-time Dual-core ARM Cortex-R5, in addition to a graphics processing unit, an ARM Mali™-400 MP2, for applications with hybrid computing needs. The ZU19EG MPSoC device makes for a very versatile chip especially well-suited for NVMe over Fabric or Open Channel implementation where the programmable logic provides a low-latency deterministic path for the storage data, and the ARM cores perform complex packet control operations or replace a host CPU in a CPU-less embedded system.

Over the last few years, Nallatech has remained at the forefront of the storage industry and contributed to its innovative growth by developing products based on NVMe technology. Nallatech recognized that FPGAs could reduce I/O bottlenecks and offer a direct high-speed deterministic path to NVMe solid state drives. As early as 2015, Nallatech partnered with Xilinx and IBM to develop an innovative NoSQL database solution12. The 250 series FPGA & MPSoC boards builds upon the success of this initial product and adds features like deeper and faster onboard memory, network connectivity, system on chip and cabling options to server storage backplanes.

250 FPGA & MPSoC Product Series

The 250 FPGA & MPSoC product line comprises three FPGA adapters, the 250S+, 250-U2 and 250-SoC, which connect to a variety of industry-standard form factors like PCIe slots, OCuLink/Nano-Pitch, SlimSAS, MiniSAS HD, U.2 storage backplanes and more. The 250 series products fit right into an existing infrastructure’s PCIe fabric for direct low-latency access to the NVMe storage devices.

250S+ Directly Attached Accelerator

The first accelerator of the series is the 250S+. This FPGA accelerator features a Xilinx UltraScale+ Kintex 15P FPGA and four onboard four-lane 1TB M.2 NVMe drives (4TB of non-volatile flash total) in a low-profile 8-lane half-height half-length PCIe compliant form factor. Alternatively, for customers who only want to introduce FPGA computing in their system and already have storage available, the M.2 onboard connectors can cable out to OCuLink/Nano-Pitch or MiniSAS HD NVMe backplanes using Molex low-loss high-speed cabling technology. With 1,143K System Logic Cells, 1,968 DSP Slices and 70.6 Mb of embedded memory, the KU15P FPGA is the largest device of the UltraScale+ Kintex FPGA series and provides a significant amount of configurable resources to implement value-add features. The on-board DDR4 memory bank allows for additional buffering of deeper data vectors.

The Nallatech 250S+ is available in two configurations:
– Up to four M.2 NMVe SSDs coupled on-card to the Xilinx FPGA
– OCuLink break-out cabling allowing the 250S+ to be part of a massively scaled storage array

This compact, high-density storage node provides an all-in-one solution for applications where the host needs to read or write data to NVMe drives at high-speed. The onboard FPGA device can efficiently orchestrate and process the streams of data to/from the storage presenting the drives as one or multiple namespaces or implementing RAID functionalities. The 250S+ can be used as a Directly Attached Accelerator (DAA) to virtualize storage allowing NVMe SSDs to be shared with multiple Virtual Machines providing a layer of isolation and security between the host CPU and the NVMe SSDs. The FPGA’s programmable logic also provides the option to packetize, compress or encrypt data inline with only a minor impact on the drive access bandwidth and latency; for example, Xilinx’s erasure coding IP introduce a negligible 90ns latency – far superior in raw performance compared to a CPU-based implementation. The 250S+ also addresses the checkpoint restart or burst buffer caching use cases; providing an easy caching solution for virtualized and standalone AI and IoT environments.

Directly Attached Accelerator (DAA)
• Virtualize the NVMe storage and share across multiple Virtual Machines
• Isolate the NVMe storage to increase security between the host CPU and the NVMe SSDs
• 250S+ & 250-SoC

250S-U2 Proxy In-Line Accelerator

The second member of the 250 series is the 250-U2. This accelerator board features a Xilinx UltraScale+ Kintex 15P FPGA (same as the 250S+) and one bank of DDR4 memory in a 2.5” U.2 drive form factor. Unlike the 250S+, the 250-U2 does not have any onboard SSDs directly attached to the FPGA. The novel design of this accelerator allows it to fit into existing U.2 storage backplanes in systems with no dedicated PCIe slots for additional compute power next to existing standard U.2 NVMe storage. This 250-U2 product takes on the role of Proxy In-Line Accelerator (PIA).

The 250-U2 can perform inline compression, encryption, and hashing, but also more complex functions such as erasure coding, deduplication, string/image search or database sort/join/filter. Depending on the computing needs of an application, the backplane population would show varying ratios of 250-U2 boards for NVMe drives. The 250-U2 sits in the U.2 backplane alongside the storage and features the same maintenance options as any other standard U.2 NVMe drives leveraging the NVMe-MI specification. As both the 250-U2 processing node and the storage connect directly to the PCIe fabric of the host server, DMA data traffic can bypass the CPU and global memory entirely for optimized end-point to end-point data transfers using technology like SPDK. With RDMA or peer-to-peer DMA solutions, the data flows directly between NVMe end-point bypassing the CPU entirely. These direct interfaces into the FPGA and MPSoC programmable logic significantly reduces access latency (Lusinsky, 201721). Alternatively, another use case for this hardware platform is as an offload compute engine and would fit nicely in a FPGAaaS scalable infrastructure.

Proxy In-Line Accelerator (PIA)
• Perform low-latency, high-bandwidth processing on local NVMe storage data
• Multiple host form factors 8-lane PCIe adaptor or 2.5” U.2
• 250S+ & 250-U2

250-SoC for NVMe-over-Fabric

The third accelerator of the series, the 250-SoC, features a Xilinx UltraScale+ Zynq 19EG MPSoC and can connect to both the network fabric through two QSFP28 ports (25Gbps line rates for 100GbE support) or the PCIe fabric through a 16-lane PCIe 3.0 host interface and four 8-lane OCuLink connectors. The ZU19EG is the largest device in its series with 1,143K System Logic Cells, 1,968 DSP Slices and 70.6 Mb of embedded memory. The embedded ARM processing and graphical units in the device package creates the ideal platform for a product with hybrid processing requirements.

The 250-SoC hardware versatility allows for direct access to storage from the network and supports NVMe-over-Fabric. NVMe-oF is the next generation NVMe protocol to disaggregate storage over the network fabric and manage storage remotely; NVMe-oF also provides additional flexibility over SAS to setup a network array on demand. Disaggregated storage or EJBOF (Ethernet Just-a-Bunch-Of-Flash) hardware reduces storage cost, footprint and power in the datacenter.

The Xilinx Zynq MPSoC chip offers additional flexibility for embedded systems. The MPSoC board can run an Operating System and its full software stack independently from a host CPU. With its high-bandwidth network features supporting up to two 100GbE ports and the onboard MPSoC, the 250-SoC removes the need for both an external Network Interface Card (NIC) and an external processor for NVMe-oF applications13. The implementation of an FPGA-based NVMe-oF infrastructure is simple and performant because the data only follows through hardware paths which gives a low and predictable latency solution.

NVMe over Fabric (NVMeoF) Block Diagram

NVMe-over-Fabric (NVMEoF)
• Low-Latency and High-Throughput of NVMe frames over the datacenter network fabric
• 250-SoC

The 250-SoC provides a flexible array of solutions for the storage industry. The 250S+ and the 250-SoC tackle the need for virtualization and increased security by targeting the Direct Attached Accelerator use case. The 250-U2 and the 250S+ easily plug-in to an existing infrastructure as Proxy In-Line Accelerators to offer low-latency & high-bandwidth local data compute for the NVMe storage. And finally, the 250-SoC supports NVMe-over-Fabric as a hardware-only innovative method to disaggregate storage while supporting the latest generation NVMe protocols. As the NVMe market continues to grow, FPGAs and MPSoC solutions will solve the application challenges of NVMe products.

NVMe Applications

NVMe technology has brought disruptive innovation to storage and has a far-reaching impact on the datacenter infrastructure. The features of the protocol make NVMe the number one choice when designing a new product or application involving storage.

Enterprise applications such as database acceleration require low-latency as well as high-bandwidth 4K or 8K data write transfer rates which are two requirements that fit perfectly into the NVMe protocol strengths. These characteristics place NVMe in the lead to implement redo log, for example, a use case where many transaction records get stored and for future replay if the database fails. For this use case, the 250S+ brings up to 4TB of NVMe storage straight to the edge of the FPGA reconfigurable fabric where the transaction records get gathered to the SSDs at high-speed ready for replay14.

NVMe also alleviates the challenges of virtualized infrastructures and simplifies the implementation of VMs (Virtual Machines), stateless VMs and SRIOV where IO is the most common bottleneck. In the stateless VM use case, the IT manager needs lock down operating system images that corporate users do not modify. Users only modify their data and the OS image remains unchanged in the NVMe storage; privacy and security between users is critical. For such IT infrastructure, NVMe storage is shared between multiple users. The 250S+ is all-in-one platform to implement this application. Each 1TB physical drive gets divided by the FPGA IP so each user gets segregated and secure access to its OS image and data. The hypervisor manages the direct access to the fraction of drive without the need for an emulation driver which provides better performance for this IO bounded application.

The “Big Data” market also brings opportunities for intelligent NVMe products which combine storage and processing since it is moving away from a batching approach to a real-time processing methodology. Map reduce problems are moving towards real-time analytics instead of batching and, therefore, they need a new tier of storage which is much faster than the GFS backend. The storage tiering now seen in IT infrastructures separates cold storage rarely accessed and low speed, to very fast SSDs, NVMe or NVM memories. In this use case, all the data gets recorded in the GDFS but then it is moved to a compute node with faster memory. The 250-SoC implementing NVMe-over-Fabric answers both these requirements as it gives access to high-speed storage and high-performance compute capabilities.

The deep learning industry has similar needs to the analytics world. The new generation accelerators for deep learning, i.e., GPGPUs, TPUs and FPGAs; these devices need large memory bandwidth to match the chips’ compute abilities. The training operations consume a lot of this high-throughput data, often multiple terabytes15. Recent research efforts show that the FPGA fabric can accelerate training operations of certain network types. Therefore, combining both the storage and the compute engine onto one hardware platform reduces the latency allowing for more retraining cycles as the training dataset increases16.

In the HPC space, local storage of the 250S+ and the remote version with the 250-SoC have several applications like checkpoint/restart, burst buffer, distributed filesystems or caching the job data from a scheduler. By running the algorithm close to the storage on the FPGA fabric, the footprint of the FPGA application remains low, while utilizing the storage fully and keeping the CPU free for other processing jobs. Instead of simply storing the data or using host CPU to compress or encrypt the in-memory databases, in which gigabytes of data are held in volatile memory but need to be backed up into flash on a regular basis. An FPGA-based system can process these snapshots of data for permanent storing into large NVMe-based storage arrays. For this type of operation, the MPSoC is particularly well-suited to perform more complex operations on the user data.

Finally, in the IoT space, there is a need for data filtering and preprocessing on IoT gateways where aggregation takes place as well as encryption for data after it has been received, the FPGA processes streams of data in real-time with bit-operations like encryption or compression and stores the data away on-board using the 250S+ or passes it to the storage backplane at the input bandwidth with the cabled 250S+ or the 250-SoC. FPGAs are also the platform of choice from blockchain calculations. Blockchain technology brings a differentiation to IoT gateways to provide an adaptive and secure method to maintain user privacy preferences of IoT devices17.

Nallatech’s Capabilities

For over twenty years, Nallatech has helped industry specialists introduce FPGAs in their infrastructure to design, develop and optimize workloads. During that time, Nallatech Compute and Network solutions have provided a competitive advantage to customers in various industries including HPC, Finance, Genomics and Embedded Computing. Nallatech combines hardware, software, and system design expertise to guide customers looking to maximize the benefits of FPGA technologies in their products.

In the 250-accelerator series, Nallatech has selected a variety of Xilinx UltraScale+ devices and PCIe form factors for a complete solution offering for storage infrastructure architects. These accelerators connect the programmable logic of the Xilinx devices directly into the infrastructure network, and PCIe fabric through last generation 100GbE and PCIe 3.0 high-speed interfaces. Additionally, using the capabilities of Nallatech parent company Molex, the 250 series offers high flexibility to connect the existing hardware. Molex is an industry leader in ultra-high-speed low-loss cables and interconnects solutions.


NVMe has and is still transforming the storage industry at a rapid pace. This new high-throughput storage technology provides a flexible storage solution for IT infrastructures. NVMe not only provides superior data write and read bandwidth compared to previous generation storage, it also leverages current PCIe and network fabric of existing datacenters. As NVMe becomes more popular, industry innovators are launching new products which support NVMe. All of the basic datacenter equipment is being updated to support NVMe; NVMe storage backplanes are now the new norm.

FPGA-based products for NVMe allow the compute to merge with the storage at the hardware level to reach higher application performance. With FPGAs, the processing of reconfigurable logic is directly attached to the storage through a high-throughput and low-latency pipe. Because of these characteristics, data can flow through the FPGA and be processed in real-time. Additionally, by using FPGA processing, the CPU cores become free to perform other tasks that can only run on the processor. With MPSoCs, additional capabilities are available to the system and combine high-speed data processing and control on the device which can potentially run in autonomy.

Nallatech FPGA and MPSoC-based storage products have been designed to target the needs of real applications and solve the challenges of IT infrastructure managers. Nallatech offers a path to production with the 250-product series. For more information, please visit www.nallatech.com/storage


1. McDowell S. (2018). Storage Industry 2018: Predictions For The Year To Come. Forbes. Retrieved June 4, 2018, from: https://www.forbes.com/sites/moorinsights/2018/01/24/storage-industry-2018-predictions-for-the-year-to-come
2. Ahmad M. (2017). Four trends to watch in NVMe-based storage designs. Electronic Designs. Retrieved June 8, 2018, from: https://www.electronicproducts.com/Computer_Peripherals/Storage/Four_trends_to_watch_in_NVMe_based_storage_designs.aspx
3. G2M Research (2018). G2M Research NVMe Ecosystem Market Sizing Report. G2M Research. Retrieved June 6, 2018, from: http://g2minc.com/g2m-research-nvme-ecosystem-market-sizing-report
4. Mehta N. (2015). Pushing Performance and Integration with the UltraScale+ Portfolio. Xilinx. Retrieved June 8, 2018, from: https://www.xilinx.com/support/documentation/white_papers/wp471-ultrascale-plus-perf.pdf
5. Allen D., & Metz J. (2018a). The Evolution and Future of NVMe. Bright Talk. Retrieved from: https://www.brighttalk.com/webcast/12367/290529
6. Nuncic (2017). More Speed for your SSD – NVME Expected to Replace SATA and SAS in the Future. OnTrack. Retrieved June 8, 2018, from: https://www.ontrack.com/blog/2017/09/15/nvme-replace-sata-sas/
7. Adshead A. (2017). Storage briefing: NVMe vs SATA and SAS. Computer Weekly. Retrieved June 8, 2018, from: https://www.computerweekly.com/feature/Storage-briefing-NVMe-vs-SATA-and-SAS
8. Rollins D. (2017). The Business Case for NVMe PCIe SSDs. Micron website. Retrieved from: https://www.micron.com/about/blogs/2017/july/the-business-case-for-nvme-pcie-ssds
9. Allen D., & Metz J. (2018b). On the Horizon for NVMe Technology: Q&A on the Evolution and Future of NVMe Webcast. NVM Express. Retrieved from: https://nvmexpress.org/on-the-horizon-for-nvme-technology-qa-on-the-evolution-and-future-of-nvme-webcast/
10. MaharanP. (2018). A Review of NVMe Optional Features for Cloud SSD Customization. Seagate Blog. Retrieved from: https://blog.seagate.com/intelligent/a-review-of-nvme-optional-features-for-cloud-ssd-customization/
11. Martin B. (2017). I/O Determinism and Its Impact on Datacenters and Hyperscale Applications. Flash Memory Summit 2017. Retrieved from: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/20170808_FB11_Martin.pdf
12. Leibso S. (2016). IBM and Nallatech demo CAPI Flash at OpenPOWER Summit in San Jose. Xcell Daily Blog. Retrieved June 4, 2018, from: https://forums.xilinx.com/t5/Xcell-Daily-Blog/IBM-and-Nallatech-demo-CAPI-Flash-at-OpenPOWER-Summit-in-San/ba-p/691256
13. SakalleyD. (2017). Using FPGAs to accelerate NVMe-oF based Storage Networks. Flash Memory Summit. Retrieved June 7, 2018, from: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/20170810_FW32_Sakalley.pdf
14. Rollins J. D. (n.d.). Redo Log Files and Backups. Wake Forest University. Retrieved from: http://users.wfu.edu/rollins/oracle/archive.html
15. Wahl M., Hartl D., Lee W., Zhu X., Menezes E., & Tok W. H. (2018). How to Use FPGAs for Deep Learning Inference to Perform Land Cover Mapping on Terabytes of Aerial Images. Microsoft Blog.
16. Teich D. (2018). Management AI: GPU and FPGA, Why They Are Important for Artificial Intelligence. Forbes. Retrieved from: https://www.forbes.com/sites/davidteich/2018/06/15/management-ai-gpu-and-fpga-why-they-are-important-for-artificial-intelligence/#6bf2ff171599
17. Cha S. C., Chen J. F., Su C., & Yeh K. H. (2018). Blockchain Connected Gateway for BLE-Based Devices in the Internet of Things. IEEE Access. Retrieved from: https://ieeexplore.ieee.org/document/8274964/
18. Alcorn (2017). Hot Chips 2017: We’ll See PCIe 4.0 This Year, PCIe 5.0 In 2019. Tom’s Hardware. Retrieved June 8, 2018, from: https://www.tomshardware.com/news/pcie-4.0-5.0-pci-sig-specfication,35325.html
19. Caulfield L. (2018). Project Denali to define flexible SSDs for cloud-scale applications. Azure Microsoft. Retrieved June 6, 2018, from: https://azure.microsoft.com/en-us/blog/project-denali-to-define-flexible-ssds-for-cloud-scale-applications/
20. Ismail N. (2017). Flash storage: transforming the storage industry. Information Age. Retrieved June 4, 2018, from: http://www.information-age.com/flash-storage-transforming-storage-industry-123465174/
21. Lusinsky R. (2017). 11 Myths about RDMA over Converged Ethernet (RoCE). Electronic Design. Retrieved June 9, 2018, from: http://www.electronicdesign.com/industrial-automation/11-myths-about-rdma-over-converged-ethernet-roce
22. Miller R. (2017). IBM’s new Power9 chip was built for AI and machine learning. Tech Crunch. Retrieved June 8, 2018, from: https://techcrunch.com/2017/12/05/ibms-new-power9-chip-architected-for-ai-and-machine-learning/
23. Peng V. (2015). 16nm UltraScale+ Series by Victor Peng, EVP & GM. Xilinx. Retrieved June 8, 2018, from: https://www.xilinx.com/video/fpga/16nm-ultrascale-plus-series.html
24. Vaid K. (2018). Microsoft creates industry standards for datacenter hardware storage and security. Azure Blog. Retrieved from: https://azure.microsoft.com/en-us/blog/microsoft-creates-industry-standards-for-datacenter-hardware-storage-and-security/
25. Retrieved from: https://blogs.technet.microsoft.com/machinelearning/2018/05/29/how-to-use-fpgas-for-deep-learning-inference-to-perform-land-cover-mapping-on-terabytes-of-aerial-images/

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA-Accelerated NVMe Storage Solutions2018-08-30T12:00:10+00:00

FPGA Acceleration of Binary Neural Networks

BNN Binary Neural Networks

DOWNLOAD WHITEPAPER: FPGA Accelerated Binary Neural Network

Deep Learning

Until only a decade ago, Artificial Intelligence resided almost exclusively within the realm of academia, research institutes and science fiction. The relatively recent realization that Deep Learning techniques could be applied practically and economically, at scale, to solve real-world application problems has resulted in a vibrant eco-system of market players.

Now, almost every application area is in some way benefiting from Deep Learning – the leveraging of Artificial Neural Networks to learn from vast volumes of data to efficiently execute specific functions. From this field of neural network research and innovation, Convolutional Neural Networks (CNNs) have emerged as a popular deep learning technique for solving image classification and object recognition problems. CNNs exploit spatial correlations within the image sets by using convolution operations. CNNs are generally regarded as the neural network of choice – especially for low-power applications because they have fewer weights and are easier to train compared to fully connected networks which demand more resources.

Neural Networks

One approach to reduce the silicon count and therefore power required to execute a high performance neural network is to reduce the dynamic range of floating-point calculations. Using 16-bit floating-point arithmetic instead of 32 bits has shown to only slightly impact the accuracy of image classification. Furthermore, depending upon the network, the accuracy of the calculation can be reduced even further to fixed point or even single bits. This trend of improving overall efficiency through implementation of reduced calculation accuracy has led to the use of binary weights i.e. weights and input activations that are binarized with only two values: +1 and -1. This new variant is known as a Binary Neural Network (BNN). It reduces all fixed-point multiplication operations in the convolutional layers and fully connected layers to 1-bit XNOR operations.

Flexible FPGAs

Established classes of conventional computing technologies have attempted to evolve at pace to cater for this dynamic market. NVIDIA, for instance, has not only adapted the underlying GPU architecture and tools, but also their product strategy and value proposition. GP-GPUs, previously marketed as the ultimate double precision floating-point engines for graphics and demanding HPC applications are now being re-positioned for the Deep Learning CNN market where half-precision arithmetic support is critical for success.

Google, one of the strongest proponents of AI, has created its own dedicated hardware architecture, the Tensor Processing Unit (TPU), which is tightly coupled with their Machine Learning framework, TensorFlow. Other industry leaders, including hyperscale innovator Microsoft, have selected Field Programmable Gate Arrays (FPGAs) for their “Brainwave” AI architecture – a pipeline of persistent neural networks that promises to deliver real-time results. This choice is no doubt linked to the confidence they gained from the highly successful (and market disrupting) use of Intel-based Arria-10 FPGAs for Bing search indexing.

This white paper explains why FPGAs are uniquely positioned to address the dynamic roadmap requirements of neural networks of all bit ranges – in particular, BNNs.

Binary Neural Networks

Processing convolutions within CNN networks requires many millions of coefficients to be stored and processed. Traditionally, each of these coefficients are stored in a full single precision representation. Research has demonstrated that coefficients can be reduced to half precision without any material change to the overall accuracy while reducing storage capacity and memory bandwidth. More significantly, this approach also shorten the training and inference time. Most of the pre-trained CNN models available today use partial reduced precision.

Figure 1 : Converting weights to binary (mean = 0.12)

By using a different approach to the training of these coefficients the bit accuracy can be reduced to a single bit, plus a scaling factor 1. During training, the floating-point coefficients are converted to binarized values and scaling a factor by averaging all output feature coefficients and subtracting this average from the original value to produce a result that is either positive or negative, represented as either 1,0 in binary notation (
Figure 1). The output of the convolution is then multiplied by the mean.

FPGA Optimizations

Firstly, binarization of the weights reduces the external memory bandwidth and storage requirements by a factor of 32. The FPGA fabric can take advantage of this binarization as each internal memory block can be configured to have a port width ranging from 1 to 32 bits. Hence, the internal FPGA resource for storage of weights is significantly reduced, providing more space for parallelization of tasks.

The binarization of the network also allows the CNN convolutions to be represented as a series of additions or subtractions of the input activations. If the weight is binary 0 the input is subtracted from the result, if the weight is binary 1 it is added to the result. Each logic element in an FPGA has addition carry chain logic that can efficiently perform integer additions of virtually any bit length. Utilizing these components efficiently allows a single FPGA device to perform tens of thousands of parallel additions. To do so the floating-point input activations must be converted to fixed precision. Given the flexibility of the FPGA fabric, we can tune the number of bits used by the fixed additions to meet the requirement of the CNN. Analysis of the dynamic range of activations in various CNNs shows that only a handful of bits, typically 8, are required to maintain an accuracy to within 1% of a floating-point equivalent design. The number of bits can be increased if more accuracy is required.

Converting to fixed point for the convolution and removing the need for multiplications via binarization dramatically reduces the logic resources required within the FPGA. It this then possible to perform significantly more processing in the same FPGA compared to a single precision or half precision implementation.

Deep Learning models are becoming deeper by adding more and more convolution layers. Having the capability to stack all these layers into a single FPGA device is critical to achieving the best performance per watt for a given cost while retaining the lowest possible latency.

FPGA Implementation

The Intel FPGA OpenCL framework was used to create the CNNs described in this paper. To optimize the design further, the Nallatech research center developed IP libraries for the binary convolution and other bit manipulation operations. This provides a powerful mix programmability and efficiency.

Table 1: Approximate Yolo V3 layers

Table 1 : Approximate Yolo V3 layers

The network targeted for this white paper was the Yolo v3 network (Table 1). This network consists largely of convolution layers and therefore the FPGA has been optimized to be as efficient at convolutions as possible.

To achieve this, the design uses a HDL block of code to perform the integer accumulations required for binary networks, making for an extremely efficient implementation.

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 lists resource requirements for the accumulation of the 8-bit activation data when using binary weights. This is equivalent to 2048 floating-point operations, but only requires 2% of the device. Note, there is extra resource required by the FPGA to restructure the data (see Table 3), so it can be processed this way, however it does illustrate the dramatic reduction in resources that can be achieved versus a floating-point implementation.

The FPGA is also required to process the other layers of Yolo v3 to minimize the data copied over the PCIe interface. These layers require much less processing and therefore less of the FPGA resource is allocated to these tasks. In order for the network to train correctly, it was necessary for activation layers to be processed with single precision accuracy. Therefore, all layers other than the convolution are calculated at single precision accuracy.

The final convolution layer is also calculated in single precision to improve training and is processed on the host CPU. Table 3 details the resources required by the OpenCL kernels including all conversions from float to 8-bit inputs, the scaling of the output data and final floating-point accumulation.

Table 3 : Resource requirements for full Yolo v3 CNN kernel (% Arria 10 GX 1150)

FPGA Accelerator Platforms

The FPGA device targeted in this whitepaper is an Intel-based Arria-10. It is a mid-range FPGA fully supported within the Intel OpenCL Software Development Kit (SDK). Nallatech delivers this flexible, energy-efficient accelerator in the form of either an add-in PCIe card or integrated rackmount server. Applications developed in OpenCL are mapped onto the FPGA fabric using Nallatech’s Board Support Package (BSP) enabling customers (predominantly software rather than hardware focused) to remain at a higher level of abstraction than is typically the case with FPGA technology.

Nallatech’s flagship “520” accelerator card shown below features Intel’s new Stratix-10 FPGA. It is a PCIe add-in card compatible with server platforms supporting GPU-class accelerators. Ideal for scaling Deep Learning platforms cost effectively.


Each convolution block performs 2048 operations per clock cycle or ~0.5 TOPS per second for a typical Arria 10 device. 4 such kernels allow Yolo v3 to be run at a frame rate of ~8 frames sec for a power consumption of 35 Watts. This is equivalent to 57 GOPS/Watt.

XNOR Networks

It is possible to further reduce compute and storage requirements of CNNs by moving to a full XNOR network. Here both the weights and activations are represented as binary inputs. In this case a convolution is represented as a simple bitwise XNOR calculation, plus some bit counting logic. This is equivalent to the binary version described earlier except that activations are now only a single bit wide.

Speed-up of such networks is estimated at 2 orders of magnitude when running on FPGA. This disruptive performance improvement enables having multiple real-time inferences running in parallel on power efficient devices. XNOR networks require a different approach to training, where activations on the forward pass are converted to binary and a scaling factor.

Whereas binary networks show little degradation in accuracy, XNOR networks show 10-20%2 difference to a floating-point equivalent. However, this is using CNNs not designed specifically of XNOR calculations. As research into this area increases, it’s likely the industry will see new models designed with XNOR network in mind, that will provide a level of accuracy close to the best CNNs, while benefiting from the tremendous efficiency of this new approach.


This whitepaper has demonstrated that significant bit reductions can be achieved without adversely impacting the quality of application results. Binary Neural Networks (BNNs), a natural fit for the properties of the FPGA, can be up to thirty times smaller than classic CNNs – delivering a range of benefits including reductions in silicon usage, memory bandwidth, power consumption and clock speed.

Given their recognized strength for efficiently implementing fixed point computations, FPGAs are uniquely positioned to address the needs of BNNs. The inherent architecture flexibility of the FPGA empowers Deep Learning innovators and offers a fast-track deployment option for any new disruptive techniques that emerge. XNOR networks are predicted to deliver major improvements in image recognition for a range of cloud, edge and embedded applications.

Nallatech, a Molex company has over 25 years of FPGA expertize and is recognized as the market leader in FPGA platforms and tools. Nallatech’s complimentary design services allow customers to successfully port, optimize, benchmark and deploy FPGA-based Deep Learning solutions cost-effectively and with minimal risk.

Please visit www.nallatech.com or email contact@nallatech.com for further information.

This work has been partly developed as part of the OPERA project to provide offloading support for low powered traffic monitoring systems: www.operaproject.eu

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Binary Neural Networks2018-06-25T15:54:26+00:00

FPGA Acceleration of Convolutional Neural Networks


Introduction – CNN – Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been shown to be extremely effective at complex image recognition problems. This white paper discusses how these networks can be accelerated using FPGA accelerator products from Nallatech, programmed using the Altera OpenCL Software Development Kit. This paper then describes how image categorization performance can be significantly improved by reducing computation precision. Each reduction in precision allows the FPGA accelerator to process increasingly more images per second.

Caffe Integration

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center and by community contributors.

The Caffe framework uses an XML interface to describe the different processing layers required for a particular CNN. By implementing different combinations of layers a user is able to quickly create a new network topology for their given requirements.

The most commonly used of these layers are:
• Convolution: The convolution layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
• Pooling: Max-pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
• Rectified-Linear: Given an input value x, The ReLU layer computes the output as x if x > 0 and negative_slope * x if x <= 0.
• InnerProduct/Fully Connected: The image is treated as single vector with each point contributing to each point of the new output vector

By porting these 4 layers to the FPGA, the vast majority of forward processing networks can be implemented on the FPGA using the Caffe framework.

Figure 1 : Example illustration of a typical CNN - Convolutional Neural Network
Figure 1 : Example illustration of a typical CNN – Convolutional Neural NetworkTo access the accelerated FPGA version of the code the user need only change the description of the CNN layer in the Caffe XML network description file to target the FPGA equivalent.


Figure 2 : ImageNet CNN - Convolutional Neural Network

Figure 2 : AlexNet CNN – Convolutional Neural Network

AlexNet is a well know and well used network, with freely available trained datasets and benchmarks. This paper discusses an FPGA implementation targeted at the AlexNet CNN, however the approach used here would apply equally well to other networks.

Figure 2 illustrates the different network layers required by the AlexNet CNN. There are 5 convolution and 3 fully connected layers. These layers occupy > 99% of the processing time for this network. There are 3 different filter sizes for the different convolution layers, 11×11, 5×5 and 3×3. To create different layers optimized for the different convolution layers would be inefficient. This is because the computational time of each layer differs depending upon the number of filters applied and the size of the input images. due to the number of input and output features processed. However, each convolution requires a different number of layers and a different number of pixels to process. By increasing the resource applied to more compute intensive layers, each layer can be balanced to complete in the same amount of time. Hence, it is therefore possible to create a pipelined process that can have several images in flight at any one time maximizing the efficiency of the logic used. I.e. most processing elements are busy most of the time.

Table 2 : ImageNet layer computation requirements when using 3x3 filters

Table 1 : ImageNet layer computation requirements

Table 1 shows the computation required for each layer of the Imagenet network. From this table it can be seen that the 5×5 convolution layer requires more compute than the other layers. Therefore, more processing logic for the FPGA will be required for this layer to be balanced with the other layers.

The inner product layers have a n to n mapping requiring a unique coefficient for each multiply add. Inner product layers usually require significantly less compute than convolutional layers and therefore require less parallelization of logic. In this scenario it makes sense to move the Inner Product layers onto the host CPU, leaving the FPGA to focus on convolutions.

FPGA logic areas

FPGA devices have two processing regions, DSP and ALU logic. The DSP logic is dedicated logic for multiply or multiply add operators. This is because using ALU logic for floating point large (18×18 bits) multiplications is costly. Given the commonality of multiplications in DSP operations FPGA vendors provided dedicated logic for this purpose. Altera have gone a step further and allow the DSP logic to be reconfigured to perform floating pointer operations. To increase the performance for CNN processing it is necessary to increase the number of multiplications that be implemented in the FPGA. One approach is to decrease the bit accuracy.

Bit Accuracy

Most CNN implementations use floating point precision for the different layer calculations. For a CPU or GPGPU implementation this is not an issue as the floating point IP is a fixed part of the chip architecture. For FPGAs the logic elements are not fixed. The Arria 10 and Stratix 10 devices from Altera have embedded floating DSP blocks that can also be used as fixed point multiplications. Each DSP component can in fact be used as two separated 18×19 bit multiplications. By performing convolution using 18 bit fixed logic the number of available operators doubles compared to single precision floating point.

Figure 3 : Arria 10 floating point DSP configuration

Figure 3 : Arria 10 floating point DSP configuration

If a reduced precision floating point processing is required it is possible to use half precision. This requires additional logic from the FPGA fabric, but doubles the number of floating point calculations possible, assuming the lower bit precision is still adequate.

One of the key advantages of the pipeline approach described in this white paper is ability to vary accuracy at different stages of the pipeline. Therefore, resources are only used where necessary, increasing the efficiency of the design.

Figure 4 : Arria 10 fixed point DSP configuration

Figure 4 : Arria 10 fixed point DSP configuration

Depending upon the CNNs application tolerance, the bit precision can be reduced further still. If the bit width of the multiplications can be reduced to 10 bits or less, (20 bit output) the multiplication can then be performed efficiently using just the FPGA ALU logic. This doubles the number of multiplications possible compared to just using the FPGA DSP logic. Some networks maybe tolerant to even lower bit precision. The FPGA can handle all precisions down to a single bit if necessary.

For the CNN layers used by AlexNet it was ascertained that 10 bit coefficient data was the minimum reduction that could be obtained for a simple fixed point implementation, whilst maintaining less than a 1% error versus a single precision floating point operation.

CNN convolution layers

Using a sliding window technique, it is possible to create convolution kernels that are extremely light on memory bandwidth.

Figure 5 : Sliding window for 3x3 convolution

Figure 5 : Sliding window for 3×3 convolution

Figure 5 illustrates how data is cached in FPGA memory allowing each pixel to be reused multiple times. The amount of data reuse is proportional to the size of the convolution kernel.

As each input layer influences all output layers in a CNN convolution layer it is possible to process multiple input layers simultaneously. This would increase the external memory bandwidth required for loading layers. To mitigate the increase all data, except for coefficients, is stored in local M20K memory on the FPGA device. The amount on chip memory on the device limits the number of CNN layers that can be implemented.

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Figure 6 : OpenCL Global Memory Bandwidth (AlexNet)

Most CNN features will fit within a single M20K memory and with thousands of M20Ks embedded in the FPGA fabric, the total memory bandwidth available for convolution features in parallel is in the order of 10’s Terabytes/sec.

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Figure 7 : Arria 10 GX1150 / Stratix 10 GX2800 resources

Depending upon the amount of M20K resource available it is not always possible to fit a complete network on a single FPGA. In this situation, multiple FPGA’s can be connected in series using high speed serial interconnects. This allows the network pipeline to be extended until sufficient resource is available.
A key advantage to this approach is it does not rely on batching to maximize performance, therefore the latency is very low, important for latency critical applications.

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Figure 8 : Extending a CNN Network Over Multiple FPGAs

Balancing the time taken between layers to be the same requires adjusting the number of parallel input layers implemented and the number of pixels processed in parallel.

Figure 9: Resources for 5x5 convolution layer of Alexnet

Figure 9: Resources for 5×5 convolution layer of Alexnet

Figure 9 lists the resources required for the 5×5 convolution layer of Alexnet with 48 parallel kernels, for both a single precision and 16 bit fixed point version on an Intel Arria10 FPGA. The numbers include the OpenCL board logic, but illustrate the benefits of lower precision has on resource.

Fully Connected Layer
Processing of a fully connected layer requires a unique coefficient for each element and therefore quickly becomes memory bound with increasing parallelism. The amount of parallelism required to keep pace with convolutional layers would quickly saturate the FPGA’s off chip memory, therefore it is proposed that his stage of the input layers either batched or pruned.

As the number of elements for an inner product layer is small the amount of storage required for batching is small versus the storage required for the convolution layers. Batching layers then allows the same coefficient to be used for each batched layer reducing the external memory bandwidth.

Pruning works by studying the input data and ignoring values below a threshold. As fully connected layers are placed at the later stages of a CNN network, many possible features have already been eliminated. Therefore, pruning can significantly reduce the amount of work required.

The key resource driver of the network is the amount of on chip M20K memories available to store the outputs of each layer. This is constant and independent of the amount of parallelism achieved. Extending the network over multiple FPGA’s increases the total amount of M20K memory available and therefore the depth of the CNN that can be processed.

The unique flexibility of the FPGA fabric allows the logic precision to be adjusted to the minimum that a particular network design requires. By limiting the bit precision of the CNN calculation the number of images that can be processed per second can be significantly increased, improving performance and reducing power.

The non-batching approach of FPGA implementation allows single frame latency for object recognition, ideal for situations where low latency is crucial. E.g. object avoidance.

Using this approach for AlexNet (single precision for layer 1, then using 16 bit fixed for remaining layers), each image can be processed in ~1.2 milliseconds with a single Arria 10 FPGA, or 0.58 milliseconds with two FPGAs in series.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Convolutional Neural Networks2018-06-25T15:27:26+00:00

Low Latency Key-Value Store / High Performance Data Center Services

Gateware Defined Networking® (GDN) – Search Low Latency Key-Value Store

Low Latency Key-Value Store (KVS) is an essential service for multiple applications. Telecom directories, Internet Protocol forwarding tables, and de-duplicating storage systems, for example, all need key-value tables to associate data with uniqu e identifiers. In datacenters, high performance KVS tables allow hundreds or thousands of machines to easily share data by simply associating values with keys and allowi ng client machines to read and write those keys and values over standard high-speed Ethernet.

Algo-Logic’s KVS leverages Gateware Defined Netw  orking® (GDN) on Field Programmable Gate Arrays (FPG As) to perform lookups with the lowest latency (less than  1 microsecond), with the highest throughput, and the least processing energy.   Deploying GDN solutions save netw ork operators’ time, cost, and power resulting in signifi cantly lower Total Cost of Ownership (TCO). DOWNLOAD FULL REPORT

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Data centers require many low-level network services to implement high-level applications. Key-Value Store (KVS) is a critical service that associates values with keys and allows machines to share these associations over a network. Most existing KVS systems run in software and scale out by running parallel processes on multiple microprocessor cores to increase throughput.

In this paper, we take an alternate approach by implementing an ultra-low-latency KVS in Field Programmable Gate Array (FPGA) logic. As with a software-based KVS, lookup transactions are sent over Ethernet to the machine that stores the value associated with that key. We find that the implementation in logic, however, scales up to provide much higher search throughput with much lower latency and power consumption than other implementations in software. DOWNLOAD FULL REPORT

Low Latency - KVS Diagram
Low Latency - KVS PacketHandler
Low Latency Key-Value Store / High Performance Data Center Services2017-03-20T10:42:47+00:00

40Gbit AES Encryption Using OpenCL and FPGAs

Altera’s launch of OpenCL support for FPGA systems has ushered in a new era in high performance computing using CPUs and FPGAs in a hybrid computing model. Altera’s OpenCL Compiler (ACL) support for FPGA cards:

• Gives programmers easy access to the power of FPGA computing.
Offers significantly higher performance at much lower power than is available using other technologies.
Provides significant time-to-market advantage compared to traditional FPGA development using a hardware description languages.
Automatically abstract details of hardware design for designers.

Implementing FPGA designs with the OpenCL compiler allows a designer to easily offload parts of their algorithm to the FPGA to increase performance, lower power and improve productivity.

This parallel programming methodology uses a kernel approach where data is passed to the specified kernel or processing. The kernel code uses C language with a minimal set of extensions that allows parts of the application code or sub routines to take advantage of parallel performance by processing via the FPGA.

This application note illustrates how to perform AES encryption on FPGAs using the OpenCL tool flow.

Advanced Encryption Standard (AES)
The Advanced Encryption Standard (AES) is a symmetric-key encryption standard that has been adopted by the U.S. government. AES can have block ciphers of 128, 192 and 256 bits in width, all of which require data in 128 bit blocks. The AES algorithm consists of multiple bit shifts and Exclusive Or (XOR) operations that make it an ideal candidate for acceleration on FPGAs.

AES Operations
AES operates on a 4×4 array of bytes, termed the state (different versions of AES with a larger block size have additional columns in the state). AES consists of four distinct processing stages, as listed below:
1. Key Expansion – The round keys are derived from the cipher key using the Rijndael’s key schedule.
2. Initial Round :
a. Add Round Key: Each byte of the state is combined with the round key using a bitwise XOR.
3. Rounds
a. Sub Bytes: A non-linear substitution step where each byte is replaced with another using a lookup table.
b. Shift Rows: A transposition step where each row of the state is shifted cyclically a certain number of steps.
c. Mix Columns: A mixing operator which operators on the columns of the state, combining the four bytes in each column.
d. Add Round Key
4. Final Round
a. Sub Bytes
b. Shift Rows
c. Add Round Key

In this implementation the host processor performs the key expansion and the results are passed to the AES encryption algorithm on the FPGA. The key schedule process varies infrequently, depending on session key changes, so there is not significant performance impact with this approach.

ECB and CTR Ciphers
Electronic Codebook (ECB) is the simplest cipher mode to program on an FPGA. It is easily replicated multiple times and can be pipelined as the output has no effect on the next result. ECB can be seen in Figure 1.

Figure 1. Electronic Codebook (ECB) mode encryption

Figure 1. Electronic Codebook (ECB) mode encryption

The downside of ECB is identical plaintext blocks are encrypted in the same way which does not hide patterns in the data. A better approach is to use a Counter (CTR) approach, as seen in Figure 2. Here a counter is encrypted, incremented and XORed with the plaintext to create the output ciphertext.

Figure 2. Counter (CTR) mode encryption

Figure 2. Counter (CTR) mode encryption

The increment of the counter is arbitrary with the most common being a simple count. An added bonus of the CTR method is encryption and decryption logic are identical (Figure 3). The counter is simply reset before decrypting.


Figure 3. Counter (CTR) mode decryption

CTR encryption also allows consecutive blocks of data to be encrypted in parallel by including a stride pattern into the counter.

OpenCL for FPGAs
The FPGA is effectively a ‘blank canvas’ on which a user can design an architecture fit for purpose.

When an OpenCL kernel is compiled using the ACL compiler, a processing architecture is designed around the needs of the algorithm. This includes integer and floating point logic built to the required depth and accuracy. The memory architecture is designed to meet the needs of the algorithm by utilizing the many hundreds of individually accessible memories available on the FPGA fabric.

Altera’s OpenCL compiler compiles OpenCL kernels, which in turn are compiled via Altera’s Quartus tools to create a SRAM object file (SOF). This SOF file can then be downloaded onto the FPGA. The OpenCL API for the host CPU allows OpenCL commands for controlling the compiled kernel source.

Describing AES encryption using OpenCL
The following pseudo code in Figure 4 describes the processing stages required for AES encryption. The nature of the AES encryption algorithm allows entire code to be unrolled into a single very deep pipeline containing thousands of integer operations. The ACL compiler has #pragma directives that can be added to a users OpenCL code to instruct nested loops to be unrolled, allowing the full AES code to be flattened. Only a few small number of changes to the original OpenCL source code are required.

Targeting a GPU with this modified source code is still possible as the #pragma directives are simply ignored. This allows the OpenCL code to be functionally verified quickly using a CPU or GPU prior to compilation to the required SOF file


Figure 4. Counter (CTR) mode decryption



Once the SOF file is built the OpenCL kernel is downloaded onto the FPGA using the Altera Quartus tools. Just-in-time compilation of the kernel is not permitted due to the FPGA compile times, therefore kernels are loaded using the clCreateProgramWithBinary method. Altera also provides an OpenCL API library for the host to allow communication to support the FPGA accelerator card.












Target Technology


Nallatech’s PCIe-385N accelerator card is supported by the ACL compiler and host OpenCL API. Figure 5 shows a top down view of the PCIe-385N. To target the Nallatech card the OpenCL kernel is compiled using the relevant compiler switch to target the 385.

The PCIe-385N features an 8-lane PCI Express Gen 3 capable interface for high speed host communications. The card also has 2 independent banks of DDR3 memory, totaling 16GBytes, coupled to the Stratix V.

AES Performance
The OpenCL FPGA program methodology allows a programmer to pick the number of “work-groups” that best fit the desired performance, whether this be as much as possible or tailored for a particular throughput. In this case the goal was to encrypt 40 Gbit Ethernet data which equates to a throughput of 5 GBytes per second. Note that although the PCIe-385N has a 10Gbit connection, the data is to be generated internally for testing purposes.


Figure 6. ACL Output for single kernel

Compilation of the AES algorithm targeting a single work group yields a predicted throughput of 240 million work items/second. Each work item is a 16 Byte word giving a throughput of 3.8 GBytes/second. This would be insufficient to encrypt 40 Gbit data. Figure 6 shows the ACL output for a single AES kernel on the FPGA. Fortunately FPGAs are particularly efficient at integer arithmetic allowing more than one work-group to fit within the FPGA. Targeting multiple work-groups is done using a simple kernel attribute “__attribute((num_copies(n)))”. To achieve the desired 40 Gbit data rate only 2 copies of the AES kernel were required. Figure 7 shows the ACL output for a single AES kernel on the FPGA.


Figure 7. ACL Output for two AES kernels

Device Utilisation
How much an algorithm uses of the FPGA is an important aspect of FPGA programming. An algorithm only utilizes the logic it requires within the FPGA, leaving the remaining logic unused and consuming minimal power. Therefore large power savings can be achieved by designing a kernel to meet only the needs of the target system, something that is not possible on CPU and GPU technologies.

There is no point designing an FPGA OpenCL kernel to run 100x faster, using say 100% of the device, when Amdahl’s law suggests only a maximum system increase of 10x is possible.

To measure the performance improvement the same OpenCL source code was compiled and ran on an AMD Radeon HD 7970 GPU card. This device has 2048 stream processors and an engine clock speed of 925 MHz. The FPGA design has 2 dedicated AES streams and a clock speed of only 170 MHz. The complexity of the AES encryption and the interdependency of the data results in a modest peak performance of ~0.33 GBytes/Sec throughput on this GPU. The FPGA AES streams are able to encrypt a full 16 Byte block every clock cycle to achieve 5.2 GBytes/Sec throughput. All performance figures reflect the kernel processing time only.

The power consumption of the FPGA accelerator is also significantly lower, requiring approximately 25 Watts compared to several hundred Watts on the GPU.

The AES source code was also compiled onto a 2GHz Intel Xeon E5503 processor achieving a performance of ~0.01 GBytes/sec per thread. The low throughput reflects upon the thousands of operators required for each 16 Byte output of the AES calculation and the limited parallel processing available to the CPU.


Figure 8. Performance results for various technologies

To achieve 40 Gbits/second throughput for the AES encryption described here, only 42 % of the Stratix A7 FPGA device was utilized. The remainder could be left unused for power savings or extra kernels could be placed in parallel to the encryption core.

Altera’s ACL compiler allows easy access to FPGA accelerator technology. For the first time, utilizing OpenCL, code is truly portable between CPU, GPU and FPGA technologies. AES encryption is a new class of algorithm that can now be tackled using the OpenCL language which was not possible to perform efficiently on traditional compute platforms.

40Gbit AES Encryption Using OpenCL and FPGAs2017-03-20T10:42:48+00:00

FPGA Acceleration of 3D Component Matching using OpenCL

2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined criteria, such as a threshold value. The technique can also the extended to volumes. Use cases include medical imaging volume analysis (e.g. MRI results), core porosity analysis (E.g. Oil & Gas) and many other connectivity analysis problems.

A technique for 2D component labeling is presented here, with a follow on section describing how this can be extended to 3D volumes. This paper shows how it is possible to dramatically accelerate the 3D component matching on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming. For 2D component matching several algorithms are commonly cited, the following are two examples…

One component at a time
The 2D image is scanned until a pixel meets the required criteria. The pixel’s neighbors are then analyzed and a linked list is created of the connected neighbors. This process is repeated recursively until no more connected neighbors are found. All pixels that were part of a connected linked list are assigned the same index. The index is then incremented and the next unconnected point on the image is analyzed. The process continues until the entire image is scanned. This technique can easily be adapted for 3 dimensions, AKA: 3D component matching.

The random traversal through memory required for this approach places the performance bottle neck on system memory bandwidth.

Two pass
For two pass algorithm the image is scanned linearly from the top left corner to the bottom right corner. A component is given an ID according to the minimum value of its neighbors. If no neighbor exists the ID value is incremented and the pixel is set to this value.

FPGA Acceleration of 3D Component Matching using OpenCL

Figure 1 : Component ID labelling


Figure 1 illustrates the surrounding pixels required to obtain the new ID for the current pixel. If the pixels A or D are non zero, differ from C and the pixel is valid, we have a condition where two ID’s clash. At this point the lowest ID is assigned to the maximum ID and a note of the swap is made in a lookup table. This lookup table is used after the image is scanned to replace ID’s that have been swapped for other ID’s in the image.



Figure 2: Components merging

Figure 2: Components merging (3D component matching)

After the image has been scanned the lookup table is used to replace merged ID’s to create the final connected component image. This is illustrated in Figure 2. The second stage is not necessary if the data is stored as both the pre-merged image and component ID’s. This avoids a costly rescan of the image. When the results are used later on, the pre-merged image is simply passed through the lookup table to produce the new image. Whether this is done or not depends upon the number of ID’s discovered and the amount of resource required to store the ID’s. This is dependent upon the size and complexity of the image.

The two pass algorithm is most suitable for implementation on an FPGA as the data access is linear and therefore suitable for a pipelined design.

3D component matching labelling
3D component matching labelling is typically performed using the “one component at a time” approach. However, for large images this can quickly move the volume outside of CPU cache and the CPU will start cache thrashing, significantly reducing the overall performance. For an FPGA approach we would also not be able to hold the volume data in local memory, limiting performance to the global memory bandwidth of the accelerator. To avoid these issues the two pass approach is applied to create a series of 2D component matched planes. The 2D planes are then combined using a similar approach to the 2D matching illustrated in Figure 2, however the previous plane in the z axis is also considered.

The number of individual ID’s required for a large volume would exceed the storage capacity of local FPGA memory. Therefore, a technique is applied where only the current and previous plane ID’s are stored. An ID that fails to appear and has not been linked with the current plane can be considered to be finished and will not occur again. At this point the ID is placed in global memory. This limits the number of ID’s to store in local memory to 2x the maximum number ID’s expected for any one plane.

Linking between planes is illustrated in Figure 3

Figure 3 : Using overlapping planes to connect components in 3D component matching

Figure 3 : Using overlapping planes to connect components in 3D component matching

It is often desirable to store statistics regarding the connective component data, such as the number of occurrences of an ID. In a similar fashion to the ID’s the current statistics of an ID can be stored in local memory. Only when an ID no longer exists are its statistics committed to global memory.

An added bonus of this approach is the ability to reuse planes for different volume analysis. If the purpose of the 3D connective component labelling is to produce spatial statistics on a volume, any overlapping volumes can reuse the 2D plane data without the need to recalculate. This can save significant amounts of processing time depending upon the amount of overlap that occurs.

FPGA implementation
The algorithm is relatively simple and contains no complex logic. Therefore it requires small amounts of compute resource relative to what is available in modern FPGA devices. Thanks to the techniques applied here, the algorithm also requires only a small percentage of the global memory bandwidth of what’s typically available on FPGA accelerator boards. Therefore it is possible to implement many parallel instantiations, either working on different thresholds or volumes until either resource or global memory bandwidth is exhausted.

The 3D volume cannot be subdivided for parallelisation, as the previous plane calculation is required prior to calculating the next. However, it is often desirable to process many different threshold values of a 3D volume in order to analysis boundaries between different materials, etc. Therefore, for this white paper it is assumed that multiple threshold values will be processed in parallel.

OpenCL Implementation
The AOC compiler provided by Altera allows users to target FPGA accelerators using the Khronos OpenCL standard. This section describes how to implement the connective 3D component matching technique using OpenCL and targeting FPGA devices.

Figure 4 : Nallatech 385 FPGA card - 3D component matching

Figure 4 : Nallatech 385 FPGA card


The FPGA device targeted here was a Stratix V A7 device. This device is a mid range FPGA of the Stratix V series and provides a good balance of on chip memory and logic gates.

In order to achieve good acceleration it was necessary to replicate the algorithm as many times as possible. For the implementation described here, two processing algorithms were implemented.

1. Creates the 2D connected component planes for 3D volume, at various different threshold values.
2. The connected 3D component matching algorithm that merges planes together to create a connected volume, recording volume statistics as desired.

This could be implemented as two different processing kernels, or as two distinct programs with the FPGA reprogrammed between stages 1 and 2. The latter is more desirable if re-use of plane data is expected and is the approach described here.

Kernel 1 : Creating the Planes
The OpenCL compiler allows two distinct programming approaches. The first is the traditional SIMD approach using an NDRange kernel, the second is a single work-item flow. This is the recommend approach by Altera if a design has loop or memory dependencies. A single work-item flow pipelines loops within the kernel, executing a new index every clock cycle if possible. This allows a technique referred to as a “sliding window” to be utilised massively reducing the impact on global memory bandwidth. The Sliding Window allows previously calculated rows to be stored in local memory removing the need to constantly refer to off chip global memory.

Figure 5 : Sliding window - 3D component matching

Figure 5 : Sliding window


With the sliding window implemented there is only one read and one write to global memory per pixel.

After the plane has been processed the new plane ID lookup table is stored in global memory ready for the second phase (Linking of planes).

It is possible to create multiple kernels on the FPGA accelerator, one for each threshold to processed, however as the FPGA is a blank canvas, every access to global memory must create its own memory controller circuitry. With 10’s of kernels implemented the memory control logic would occupy a large amount of the on device resource. To avoid replication of the memory circuitry we can create 2 kernels dedicated to handling global memory accesses, one for reading input data and another for writing output data; I.e. a producer and consumer kernel. These kernels then fan data out and consume results to and from multiple processing kernels. This prevents the unnecessary replication of global memory logic and allows more parallel paths to be implemented.

Figure 6 : Multiple kernels connected via channels - 3D component matching

Figure 6 : Multiple kernels connected via channels

Figure 6 shows the arrangement of consumer, producer and worker kernels used to implement multiple paths. The communication between kernels is done via channels. Each worker kernel receives data from its own channel and writes results back to its own output channel. Each worker kernel is therefore identical with the exception of the channel IDs.






Kernel 2 : Linking the planes
Once each plane has been created it is necessary to link the planes in order to create the 3D connected component volume. Again a sliding window is used to reduce the number of global memory accesses. In this case two inputs and one output are required as we require the current and previous planes input data.

Figure 7 : Back plane sliding window - 3D component matching

Figure 7 : Back plane sliding window

Figure 8 : Linking IDs between planes - 3D component matching

Figure 8 : Linking IDs between planes


Figure 8 illustrates the 9 pixels from the previous plane that are possible connections with the front pixel. As we scan along the current row any paths along the three back rows are tracked, see Figure 9. If no paths via the back or current plane are possible, a path is said to no longer exist and the current path ends. Once a row is complete the ID’s are then modified to equal the minimum ID found on the valid paths.



Figure 9: Link front plane to back plane - 3D component matching

Figure 9: Link front plane to back plane


As ID’s can change from one row to the next, an ID conversion must be applied for new back pixel read from the sliding window. This would ordinarily require 9 reads from the ID lookup table, however we can use the locality of the data to realise that adjacent pixels of the back plane must be equivalent. We can therefore reduce the 9 possible ID’s to just 4. This is convenient as the Altera FPGA devices permit 4 accesses to local memory simultaneously.

After the plane is completed the statistics of any IDs that no longer exist are stored in global memory to be retrieved by the host.

Multiple Binaries
An individual aocx (device binary) file was generated for the plane creation and for linking the planes. The host programs the device with the first binary and executes. The results produced by the first binary are placed in global memory. The device is then reprogrammed with the second binary and executes reading the previous binaries results from global memory. The final results are then retrieved by the host.

The following benchmark targeted Nallatechs p385n_hpc_a7 accelerator board. This allowed up to 8 parallel worker kernels to be instantiated in a single FPGA device. This was then compared to a single core of a Xeon E5-2430 2GHz device with a cache size of 15360 Kbytes. The Xeon implemented a “one component at a time” technique optimized for a CPU.

The performance improvement of the FPGA varies depending upon the complexity of the volume of data being analyzed. The more complex the image the better the FPGA performs compared to the CPU. When the image is sparse the FPGA has only a few ID’s to report, however the CPU does not traverse far through the volume when processing its data and acceleration is less. When the data is dense, the CPU must traverse to all points in a nonlinear fashion, whereas the FPGA linearly traverses the data with significant performance improvement.

To quantify the acceleration it is necessary to plot performance increase against the density of the image. Figure 10 shows the time taken to process 8 threshold values for varying density of the volume.

Figure 10 : Processing time versus density of valid data (%) (256x256x256 data points, 8 parallel thresholds)- 3D component matching

Figure 10 : Processing time versus density of valid data (%)
(256x256x256 data points, 8 parallel thresholds)

Figure 11 : Acceleration versus density of valid data (%) - 3D component matching

Figure 11 : Acceleration versus density of valid data (%)

As can be seen from Figure 10 the performance of the Xeon tails off quickly for volumes with a high percentage of valid data points. This is due to the linked list used to track the current position growing in complexity. The FPGA version does not require the storage of a linked list and is therefore unaffected by how densely packed the volume is. However, the FPGA performance is affected by the number of unique path IDs. Any IDs that must be merged will have their IDs swapped after the end of each row. The likelihood of this occurring increases with the number of IDs and therefore increases the time spent in the ID swapping logic. For a very dense volume the number of unique IDs reduces until there is just 1 ID for a nearly full volume. At this point the FPGA has to perform no ID swapping and the FPGA implementation is then at its most efficient.

Figure 12 : Processing time (seconds) versus percentage of volume occupied. (256x256x256 data points, 8 parallel thresholds) - 3D component matching

Figure 12 : Processing time (seconds) versus percentage of
volume occupied. (256x256x256 data points, 8 parallel thresholds)

Using OpenCL and FPGAs it is possible to significantly accelerate connected 3D component matching. With the memory efficient algorithm described here, it is possible to replicate the processing kernel multiple times. This technique should extend to future larger FPGAs with more resource. Therefore next generation FPGAs should yield significantly greater performance than what is demonstrated here.

The plane implementation will also scale to larger volumes. The only limitation will be on the number of IDs required for each plane. The number of potential unique IDs increases linearly with the area of the plane. These have to be store in local memory on the FPGA. However, there is no limit to the number of planes or depth of volume, global memory depth permitting.

Future Roadmap
The next generation of Altera FPGAs will provide an order of magnitude improvement over the results presented here. With the introduction of Stratix 10, Altera production will utilize 14nm TriGate transistor technology. The resulting higher clock speed and denser devices result in a step change in overall performance. Applying the performance gains to the 385 results presented earlier, demonstrates a greater than 10x performance improvement versus Stratix V.


Figure 13 : Acceleration Versus a single Xeon Core  3D component matching

Figure 13 : Acceleration Versus a single Xeon Core

FPGA Acceleration of 3D Component Matching using OpenCL2017-03-20T10:42:48+00:00

FPGA Acceleration of Lattice Boltzmann using OpenCL

The Lattice Boltzmann Method (LBM) is a technique for simulating the movement of complex fluid systems. Fluid systems are used in many industries to transmit signals and power using a network of tanks, pipes, values, pumps and other flow devices. Examples of applications include industrial processing, vehicular control and medical appliances. It is important that companies using fluid systems in this way have a systematic method of mathematically modelling different types of fluid systems for safe and reliable operation. This can be achieved, but typically at significant computational cost. Computational Fluid Dynamics (CFD) is one of the most demanding branches of high-performance computing (HPC), in terms of resources. There is constant demand for cheaper, faster CFD computing platforms. This paper describes how it is possible to dramatically accelerate the LBM technique on an energy-efficient FPGA-based platform using OpenCL – the open standard for parallel programming.

Traditional CFD methods solve the conservation equations for mass, energy, etc., whereas the LBM model uses particles to propagate these quantities. To simulate every particle in a system would be impossible, hence the LBM technique uses particle densities confined to a discrete lattice to simulate particle interactions.

The LBM technique is split into two stages: Collision and Streaming. The collision stage looks to balance the particle distributions. There are various techniques for finding an equilibrium, some more accurate than others. The operator used here is the Bhatnagar-Gross-Krook (BGK) operator.


Figure 1 : D2Q9 Lattice


Different lattice topologies are possible for different dimensions and algorithm approaches. A popular way to classify lattices is the DnQm scheme, where n stands for the number of dimensions and m the lattice velocity distributions.

The lattice used in this white paper is a D2Q9 lattice illustrated in Figure 1.


LBM Maths
The following equation is the BGK operator applied to the 9 lattice points contributing to the current lattice point …


Equation 1 : D2Q9 equilibrium distribution function










Once the new distributions are calculated, they must be distributed to neighboring lattice points. This is the Streaming stage of LBM.

Figure 2: Streaming stage

Figure 2: Streaming stage

Particle distributions are swapped between lattice points along the 8 non-zero direction vectors.

The lattice Boltzmann code is a memory bound problem. For the D2Q9 lattice 9 floating point numbers must be read and updated for every lattice during the collision phase. Here data is read in a linear fashion, however the propagate stage must implement some out of order memory accesses to swap data between adjacent lattice points.

For a GPU implementation, it is the global memory access that ultimately limits the performance of the lattice Boltzmann code. FPGAs, however, offer an alternative approach that removes this memory bottleneck and provides almost unlimited scalability.

Lattice Boltzmann FPGA OpenCL
Typical OpenCL Lattice Boltzmann implementations work by creating hundreds of threads, all working in parallel, but ultimately limited by the global memory bandwidth available. The Altera OpenCL compiler offers an alternative OpenCL programming model that creates one or more pipelined kernels, where parallelism comes from the complexity of the pipeline. The more complex the pipeline, the more floating logic is performed in parallel.

FPGAs have significant local memory resources than can be configured in many different ways, from large single buffers to hundreds of small buffers. This flexibility allows the Altera OpenCL Compiler (AOC) to create memory topologies specifically designed for the algorithm that needs accelerating. The consequence of this is to significantly reduce the global memory bandwidth requirements of the algorithm.

The Collision stage of the LB algorithm accesses global memory in a linear fashion and needs no optimizations. However, the streaming stage requires data from the neighboring lattice points. The delivery of the data can be optimized by using a cached copy of the output in what is referred to as a Sliding Window. A Sliding Window approach allows data to be read linearly and buffered in local memory, from which data can be read as often as required.

Figure 3 : Sliding window

Figure 3 : Sliding window


The Sliding Window allows previously-calculated rows to be stored in local memory allowing the streaming stage to be combined with the collision stage. The entire algorithm can then be pipelined to generate a result every clock cycle. What’s more, multiple pipelines can be cascaded to gather with global memory access only required for input into the first stage and the output from the final stage. The number of Lattice points calculated per second, therefore, increases linearly per pipeline stage with no increase in global memory requirements.


Table 1


Table 1 lists the performance of the BGK D2Q9 algorithm for various technologies.

There are 106 floating point calculations required per LUT. This makes the sustained floating point performance equivalent to 106 multiplied by the LUTs/sec.

Figure 4: Multiple time steps implemented in a pipeline

Figure 4: Multiple time steps implemented in a pipeline











The pipelining allows the performance to be linearly improved with each new pipeline stage, until the resources of the FPGA are exhausted. Four such pipelines fit into a pcie385n_d5 part.

Figure 5 : PCie385n_d5 and Server

Figure 5 : PCie385n_d5 and Server









Figure 6 : Performance, MLUTs/Sec

Figure 6 : Performance, MLUTs/Sec

When measuring HPC performance, it is important to consider the power footprint of different technologies. Figure 7 watt for the three technologies available to study.

Figure 7 : Performance, MLUTs/Sec/Watt

Figure 7 : Performance, MLUTs/Sec/Watt

The following images show the output of the FPGA implementation. Yellow depicts the areas of fastest flow, whilst the black areas are slowest.


Figure 8 : Flow through a slit

Figure 8 : Flow through a slit

Figure 9 : Turbulent flow around a sphere

Figure 9 : Turbulent flow around a sphere

Figure 10 : Flow through a porous object

Figure 10 : Flow through a porous object

D3Q19 3D Lattice
The implementation described here can also be applied to a 3D lattice. In this case, the Sliding Window stores planes rather than lines of lattice data, which requires more internal memory and limits the plane size. Therefore, the cross section of the volume can be calculated using this approach. The depth of the lattice is, however, unlimited.

Using the OpenCL tool flow, it was possible to achieve significant acceleration of a well-known HPC problem using FPGA technology in only a few days of coding. By abstracting the complexities of FPGA interfaces and hardware description languages, OpenCL massively increases productivity without significantly sacrificing design performance. This allows developers to quickly verify the suitability of FPGA acceleration without committing to months/years of design effort. To learn more about the advantages of FPGA-based acceleration, please visit nallatech.com


FPGA Acceleration of Lattice Boltzmann using OpenCL2017-03-20T10:42:48+00:00
Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.