About FPGASolutions

This author has not yet filled in any details.
So far FPGASolutions has created 29 blog entries.

FPGA-Accelerated NVMe Storage Solutions



In recent years, the migration towards NAND flash-based storage and the introduction of Non-Volatile Memory Express® (NVMe™) have multiplied the opportunities for technology companies to “do storage” differently1. The rapid growth and diversity of real-time digital businesses has demanded this innovation to allow new products and services to be realized. New storage products have therefore followed trends towards higher bandwidth, lower latency and a reduction in footprint and total cost of ownership – critical improvements for companies relying on large infrastructures. Recent market reports2 forecast that the NVMe market will grow at approximately 15% CAGR to reach $57 billion by 2020. The NVMe market continues to evolve and seeks further technological innovations in three areas:

(1) storage virtualization to increase flexibility and security
(2) localized data processing close to the stored data
(3) disaggregated storage for optimized infrastructures3

In March 2018, Nallatech announced the 250 series FPGA products which provides innovative solutions to cater to the needs of the storage market. The 250-series product features the Xilinx® UltraScale+™ FPGAs and MPSoCs which offers ASIC-class functionality in a single-chip and fits the technology needs of the storage industry6. By combining NVMe with reconfigurable logic FPGA and MPSoC, Nallatech is offering a new class of storage products with a critical differentiator in a fast-evolving market; the flexibility and reconfigurability of the Xilinx devices guarantees that 20-based solutions can remain current as the NVMe standard incorporates new features overtime5.

This application note describes how Nallatech’s 250 series of FPGA and MPSoC-enabled accelerator products can be used to allow customers to construct high-performance, scalable NVMe infrastructures for next generation IoT and cloud infrastructures.

The Technology & Characteristics

NVM Express is a high-throughput and low-latency storage technology which has increased the performance and flexibility of existing datacenter infrastructures8. By replacing current storage technologies like SAS and SATA with NVMe, IT architects have multiplied the available storage throughput of their datacenters6. The NVMe protocol sits on top of the PCIe transport layer and can operate over PCIe 3.0, with new drives expected to become available in 2018, NVMe will soon support PCIe 4.0 rates7. With NVMe over PCIe 3.0, NVMe can respectively transfer up to a maximum theoretical 4GB/s in the U.2 or M.2 4-lane form factor; for example, the Micron 9200 SSD U.2 drives display sustained sequential read/write performance up to 3.5GB/s according to the manufacturer’s website.

Beyond the additional read/write throughput, NVMe supports new features that improve a storage array over SAS or SATA equivalents. The NVMe protocol better fits the needs of enterprise environments by supporting 64K I/O queues. Previous technologies based on spinning disks showed limitations due to fewer queues. With NVMe, each of the host’s processes can have its own queue and manage it independently. Additionally, the NVMe specification provides a well-defined arbitration mechanism to assign priorities to each of the queues as well as support for MSI/MSI-X and interrupt aggregation. This finer grain tuning for read/write operations to the NVMe storage increases the overall system performance.

The NVMe specification also supports more efficient random I/O operations and outperforms SAS or SATA SSD technologies by a factor of 10 in 4KB random read tests. The performances improvement for 4KB transfers better aligns with the requirement of enterprise software applications and operating systems. The support for multiple namespaces and the efficient support of I/O virtualization architectures like SR-IOV make NVMe the technology of choice in the datacenter. The high performance of the NVMe technology facilitates the sharing of hardware resources between several Virtual Machines or users. For example, 25 IOPS is a relevant estimate for the requirements of a power user in a virtualized infrastructure. With 700,000 random 4K read IOPS, a single NVMe drive could fulfill the need of 500 power users where 25 SAS HDD would typically be needed8.

Bringing FPGAs and MPSoCs into the mix adds flexibility to an already performant NVMe datacenter infrastructure. The NVMe specification supports a superset of commands that not all NVMe drives support; by using IP-based NVMe controllers, adjustments to new NVMe equipment only requires a firmware update for reconfigurable logic instead of a hardware upgrade. In fact, over recent months, new innovative approaches to storage, like Open Channel, have caused a mini-revolution in the NVMe market. Large datacenter players are now pushing for a new NVMe hardware model. Microsoft recently introduced project Denali which specifies a methodology to disaggregate flash storage where the NVMe drives do not handle management functions, instead, new reconfigurable hardware does. With project Denali and Open Channel, the functions traditionally performed in the monolithic SSD drives are offloaded to the host or an FPGA/MPSoC accelerator. Nallatech’s 250 product series fits this new model and provides an agile solution which will adapt to the new models as they get finalized.

NVMe Roadmap

Since the creation of NVMe in 2011, the NVMe consortium has remained very active. In fact, the NVMe protocol is currently evolving from three perspectives defined in separate specifications. In addition to the base NVMe specification, the NVMe Management Interface (NVMe-MI) details how to manage communications and the devices (device discovery, monitoring, etc.) and the NVMe over Fabric (NVMe-oF) drives how to communicate with non-volatile storage over a network to present the protocol as transport agnostic9.

Over time, as more users from various industries start adopting NVMe, the new users characterize their need for new features and introduce new ideas for the specification. The adoption of the NVMe protocol is still growing and it is generating innovation. Hardware and software companies are finding new ways to get to the memory through the introduction of new form factors, the creation of new products and appliances, etc. The focus of the NVMe ecosystem is to give users the means to scale into the datacenter or hyperscale infrastructures, and the protocol specification will continue to evolve in that direction9.

2019 will see the release of revision 1.4 of the NVMe base specification which will lead to improvements in data latency, high-performance access to non-volatile data and ease of data sharing between several hosts. One of the features awaited by NVMe users, and cloud providers specifically, is IO determinism which will increase the quality of service during parallel execution of IOs10. By limiting the impact of background maintenance tasks to a minimum and containing the influence of noisy neighbors, the IO determinism feature will give users a consistent latency when accessing the non-volatile data. An alternative approach is the previously discussed Open Channel architecture11. With this second method, the host takes over some of the management functions and only the data travels to the storage hardware. In this configuration, the drive’s physical interface to the host is limited to high-speed data lanes, there are no sideband channels. This example shows the impact and relevance of any changes in the NVMe specification and highlights the requirements for a flexible NVMe hardware infrastructure.

As the new revisions of the base, MI and over Fabric specifications come out in the next few months, NVMe users will benefit from a flexible foundation which can adapt to the new NVMe requirements. The 250 series FPGA and MPSoC products provide this flexibility but also solve today’s customers’ challenges and give them an immediate competitive advantage.

Why FPGAs?

Nallatech’s FPGA and MPSoC products feature the very latest Xilinx UltraScale+ technology and fit the need of a datacenter increasingly focused on NVMe. FPGAs have provided programmable hardware solutions to multiple industries over three decades and are broadly used to solve computing and embedded systems problems in automotive, broadcasting, medical and military markets amongst others. At the same time, in recent years, FPGA manufacturers have introduced the latest and greatest in integrated systems design improvements to this proven technology.

The Xilinx UltraScale+ FPGA and MPSoC products use a 16nm process and improve the system performance by providing high-speed fabric, embedded RAM, clocking, and DSP processing. Besides, Xilinx devices have introduced faster transceiver technology (up to 32.75 Gb/s) for higher throughput connectivity into the network or the PCIe fabric. With their high count of serial transceiver channels, UltraScale+ products can connect to multiple PCIe interfaces at once and provide a data offload interface to a host CPU. In some cases, by replacing a PLX switch with an FPGA or MPSoC, the CPU can offload some of its processing and free up for other operations. The programmable logic of FPGA and MPSoC also provide a deterministic and low-latency interface in a system which can give a clear competitive advantage in some use cases.

Recent FPGA families now also include embedded low-power microprocessors inside the device fabric. The UltraScale+ MPSoCs match the need of applications that require software as well as programmable logic by combining them into a single package. For example, the Xilinx Zynq UltraScale+ ZU19EG features two processing units, one Quad-core ARM Cortex-A53 and one real-time Dual-core ARM Cortex-R5, in addition to a graphics processing unit, an ARM Mali™-400 MP2, for applications with hybrid computing needs. The ZU19EG MPSoC device makes for a very versatile chip especially well-suited for NVMe over Fabric or Open Channel implementation where the programmable logic provides a low-latency deterministic path for the storage data, and the ARM cores perform complex packet control operations or replace a host CPU in a CPU-less embedded system.

Over the last few years, Nallatech has remained at the forefront of the storage industry and contributed to its innovative growth by developing products based on NVMe technology. Nallatech recognized that FPGAs could reduce I/O bottlenecks and offer a direct high-speed deterministic path to NVMe solid state drives. As early as 2015, Nallatech partnered with Xilinx and IBM to develop an innovative NoSQL database solution12. The 250 series FPGA & MPSoC boards builds upon the success of this initial product and adds features like deeper and faster onboard memory, network connectivity, system on chip and cabling options to server storage backplanes.

250 FPGA & MPSoC Product Series

The 250 FPGA & MPSoC product line comprises three FPGA adapters, the 250S+, 250-U2 and 250-SoC, which connect to a variety of industry-standard form factors like PCIe slots, OCuLink/Nano-Pitch, SlimSAS, MiniSAS HD, U.2 storage backplanes and more. The 250 series products fit right into an existing infrastructure’s PCIe fabric for direct low-latency access to the NVMe storage devices.

250S+ Directly Attached Accelerator

The first accelerator of the series is the 250S+. This FPGA accelerator features a Xilinx UltraScale+ Kintex 15P FPGA and four onboard four-lane 1TB M.2 NVMe drives (4TB of non-volatile flash total) in a low-profile 8-lane half-height half-length PCIe compliant form factor. Alternatively, for customers who only want to introduce FPGA computing in their system and already have storage available, the M.2 onboard connectors can cable out to OCuLink/Nano-Pitch or MiniSAS HD NVMe backplanes using Molex low-loss high-speed cabling technology. With 1,143K System Logic Cells, 1,968 DSP Slices and 70.6 Mb of embedded memory, the KU15P FPGA is the largest device of the UltraScale+ Kintex FPGA series and provides a significant amount of configurable resources to implement value-add features. The on-board DDR4 memory bank allows for additional buffering of deeper data vectors.

The Nallatech 250S+ is available in two configurations:
– Up to four M.2 NMVe SSDs coupled on-card to the Xilinx FPGA
– OCuLink break-out cabling allowing the 250S+ to be part of a massively scaled storage array

This compact, high-density storage node provides an all-in-one solution for applications where the host needs to read or write data to NVMe drives at high-speed. The onboard FPGA device can efficiently orchestrate and process the streams of data to/from the storage presenting the drives as one or multiple namespaces or implementing RAID functionalities. The 250S+ can be used as a Directly Attached Accelerator (DAA) to virtualize storage allowing NVMe SSDs to be shared with multiple Virtual Machines providing a layer of isolation and security between the host CPU and the NVMe SSDs. The FPGA’s programmable logic also provides the option to packetize, compress or encrypt data inline with only a minor impact on the drive access bandwidth and latency; for example, Xilinx’s erasure coding IP introduce a negligible 90ns latency – far superior in raw performance compared to a CPU-based implementation. The 250S+ also addresses the checkpoint restart or burst buffer caching use cases; providing an easy caching solution for virtualized and standalone AI and IoT environments.

Directly Attached Accelerator (DAA)
• Virtualize the NVMe storage and share across multiple Virtual Machines
• Isolate the NVMe storage to increase security between the host CPU and the NVMe SSDs
• 250S+ & 250-SoC

250S-U2 Proxy In-Line Accelerator

The second member of the 250 series is the 250-U2. This accelerator board features a Xilinx UltraScale+ Kintex 15P FPGA (same as the 250S+) and one bank of DDR4 memory in a 2.5” U.2 drive form factor. Unlike the 250S+, the 250-U2 does not have any onboard SSDs directly attached to the FPGA. The novel design of this accelerator allows it to fit into existing U.2 storage backplanes in systems with no dedicated PCIe slots for additional compute power next to existing standard U.2 NVMe storage. This 250-U2 product takes on the role of Proxy In-Line Accelerator (PIA).

The 250-U2 can perform inline compression, encryption, and hashing, but also more complex functions such as erasure coding, deduplication, string/image search or database sort/join/filter. Depending on the computing needs of an application, the backplane population would show varying ratios of 250-U2 boards for NVMe drives. The 250-U2 sits in the U.2 backplane alongside the storage and features the same maintenance options as any other standard U.2 NVMe drives leveraging the NVMe-MI specification. As both the 250-U2 processing node and the storage connect directly to the PCIe fabric of the host server, DMA data traffic can bypass the CPU and global memory entirely for optimized end-point to end-point data transfers using technology like SPDK. With RDMA or peer-to-peer DMA solutions, the data flows directly between NVMe end-point bypassing the CPU entirely. These direct interfaces into the FPGA and MPSoC programmable logic significantly reduces access latency (Lusinsky, 201721). Alternatively, another use case for this hardware platform is as an offload compute engine and would fit nicely in a FPGAaaS scalable infrastructure.

Proxy In-Line Accelerator (PIA)
• Perform low-latency, high-bandwidth processing on local NVMe storage data
• Multiple host form factors 8-lane PCIe adaptor or 2.5” U.2
• 250S+ & 250-U2

250-SoC for NVMe-over-Fabric

The third accelerator of the series, the 250-SoC, features a Xilinx UltraScale+ Zynq 19EG MPSoC and can connect to both the network fabric through two QSFP28 ports (25Gbps line rates for 100GbE support) or the PCIe fabric through a 16-lane PCIe 3.0 host interface and four 8-lane OCuLink connectors. The ZU19EG is the largest device in its series with 1,143K System Logic Cells, 1,968 DSP Slices and 70.6 Mb of embedded memory. The embedded ARM processing and graphical units in the device package creates the ideal platform for a product with hybrid processing requirements.

The 250-SoC hardware versatility allows for direct access to storage from the network and supports NVMe-over-Fabric. NVMe-oF is the next generation NVMe protocol to disaggregate storage over the network fabric and manage storage remotely; NVMe-oF also provides additional flexibility over SAS to setup a network array on demand. Disaggregated storage or EJBOF (Ethernet Just-a-Bunch-Of-Flash) hardware reduces storage cost, footprint and power in the datacenter.

The Xilinx Zynq MPSoC chip offers additional flexibility for embedded systems. The MPSoC board can run an Operating System and its full software stack independently from a host CPU. With its high-bandwidth network features supporting up to two 100GbE ports and the onboard MPSoC, the 250-SoC removes the need for both an external Network Interface Card (NIC) and an external processor for NVMe-oF applications13. The implementation of an FPGA-based NVMe-oF infrastructure is simple and performant because the data only follows through hardware paths which gives a low and predictable latency solution.

NVMe over Fabric (NVMeoF) Block Diagram

NVMe-over-Fabric (NVMEoF)
• Low-Latency and High-Throughput of NVMe frames over the datacenter network fabric
• 250-SoC

The 250-SoC provides a flexible array of solutions for the storage industry. The 250S+ and the 250-SoC tackle the need for virtualization and increased security by targeting the Direct Attached Accelerator use case. The 250-U2 and the 250S+ easily plug-in to an existing infrastructure as Proxy In-Line Accelerators to offer low-latency & high-bandwidth local data compute for the NVMe storage. And finally, the 250-SoC supports NVMe-over-Fabric as a hardware-only innovative method to disaggregate storage while supporting the latest generation NVMe protocols. As the NVMe market continues to grow, FPGAs and MPSoC solutions will solve the application challenges of NVMe products.

NVMe Applications

NVMe technology has brought disruptive innovation to storage and has a far-reaching impact on the datacenter infrastructure. The features of the protocol make NVMe the number one choice when designing a new product or application involving storage.

Enterprise applications such as database acceleration require low-latency as well as high-bandwidth 4K or 8K data write transfer rates which are two requirements that fit perfectly into the NVMe protocol strengths. These characteristics place NVMe in the lead to implement redo log, for example, a use case where many transaction records get stored and for future replay if the database fails. For this use case, the 250S+ brings up to 4TB of NVMe storage straight to the edge of the FPGA reconfigurable fabric where the transaction records get gathered to the SSDs at high-speed ready for replay14.

NVMe also alleviates the challenges of virtualized infrastructures and simplifies the implementation of VMs (Virtual Machines), stateless VMs and SRIOV where IO is the most common bottleneck. In the stateless VM use case, the IT manager needs lock down operating system images that corporate users do not modify. Users only modify their data and the OS image remains unchanged in the NVMe storage; privacy and security between users is critical. For such IT infrastructure, NVMe storage is shared between multiple users. The 250S+ is all-in-one platform to implement this application. Each 1TB physical drive gets divided by the FPGA IP so each user gets segregated and secure access to its OS image and data. The hypervisor manages the direct access to the fraction of drive without the need for an emulation driver which provides better performance for this IO bounded application.

The “Big Data” market also brings opportunities for intelligent NVMe products which combine storage and processing since it is moving away from a batching approach to a real-time processing methodology. Map reduce problems are moving towards real-time analytics instead of batching and, therefore, they need a new tier of storage which is much faster than the GFS backend. The storage tiering now seen in IT infrastructures separates cold storage rarely accessed and low speed, to very fast SSDs, NVMe or NVM memories. In this use case, all the data gets recorded in the GDFS but then it is moved to a compute node with faster memory. The 250-SoC implementing NVMe-over-Fabric answers both these requirements as it gives access to high-speed storage and high-performance compute capabilities.

The deep learning industry has similar needs to the analytics world. The new generation accelerators for deep learning, i.e., GPGPUs, TPUs and FPGAs; these devices need large memory bandwidth to match the chips’ compute abilities. The training operations consume a lot of this high-throughput data, often multiple terabytes15. Recent research efforts show that the FPGA fabric can accelerate training operations of certain network types. Therefore, combining both the storage and the compute engine onto one hardware platform reduces the latency allowing for more retraining cycles as the training dataset increases16.

In the HPC space, local storage of the 250S+ and the remote version with the 250-SoC have several applications like checkpoint/restart, burst buffer, distributed filesystems or caching the job data from a scheduler. By running the algorithm close to the storage on the FPGA fabric, the footprint of the FPGA application remains low, while utilizing the storage fully and keeping the CPU free for other processing jobs. Instead of simply storing the data or using host CPU to compress or encrypt the in-memory databases, in which gigabytes of data are held in volatile memory but need to be backed up into flash on a regular basis. An FPGA-based system can process these snapshots of data for permanent storing into large NVMe-based storage arrays. For this type of operation, the MPSoC is particularly well-suited to perform more complex operations on the user data.

Finally, in the IoT space, there is a need for data filtering and preprocessing on IoT gateways where aggregation takes place as well as encryption for data after it has been received, the FPGA processes streams of data in real-time with bit-operations like encryption or compression and stores the data away on-board using the 250S+ or passes it to the storage backplane at the input bandwidth with the cabled 250S+ or the 250-SoC. FPGAs are also the platform of choice from blockchain calculations. Blockchain technology brings a differentiation to IoT gateways to provide an adaptive and secure method to maintain user privacy preferences of IoT devices17.

Nallatech’s Capabilities

For over twenty years, Nallatech has helped industry specialists introduce FPGAs in their infrastructure to design, develop and optimize workloads. During that time, Nallatech Compute and Network solutions have provided a competitive advantage to customers in various industries including HPC, Finance, Genomics and Embedded Computing. Nallatech combines hardware, software, and system design expertise to guide customers looking to maximize the benefits of FPGA technologies in their products.

In the 250-accelerator series, Nallatech has selected a variety of Xilinx UltraScale+ devices and PCIe form factors for a complete solution offering for storage infrastructure architects. These accelerators connect the programmable logic of the Xilinx devices directly into the infrastructure network, and PCIe fabric through last generation 100GbE and PCIe 3.0 high-speed interfaces. Additionally, using the capabilities of Nallatech parent company Molex, the 250 series offers high flexibility to connect the existing hardware. Molex is an industry leader in ultra-high-speed low-loss cables and interconnects solutions.


NVMe has and is still transforming the storage industry at a rapid pace. This new high-throughput storage technology provides a flexible storage solution for IT infrastructures. NVMe not only provides superior data write and read bandwidth compared to previous generation storage, it also leverages current PCIe and network fabric of existing datacenters. As NVMe becomes more popular, industry innovators are launching new products which support NVMe. All of the basic datacenter equipment is being updated to support NVMe; NVMe storage backplanes are now the new norm.

FPGA-based products for NVMe allow the compute to merge with the storage at the hardware level to reach higher application performance. With FPGAs, the processing of reconfigurable logic is directly attached to the storage through a high-throughput and low-latency pipe. Because of these characteristics, data can flow through the FPGA and be processed in real-time. Additionally, by using FPGA processing, the CPU cores become free to perform other tasks that can only run on the processor. With MPSoCs, additional capabilities are available to the system and combine high-speed data processing and control on the device which can potentially run in autonomy.

Nallatech FPGA and MPSoC-based storage products have been designed to target the needs of real applications and solve the challenges of IT infrastructure managers. Nallatech offers a path to production with the 250-product series. For more information, please visit www.nallatech.com/storage


1. McDowell S. (2018). Storage Industry 2018: Predictions For The Year To Come. Forbes. Retrieved June 4, 2018, from: https://www.forbes.com/sites/moorinsights/2018/01/24/storage-industry-2018-predictions-for-the-year-to-come
2. Ahmad M. (2017). Four trends to watch in NVMe-based storage designs. Electronic Designs. Retrieved June 8, 2018, from: https://www.electronicproducts.com/Computer_Peripherals/Storage/Four_trends_to_watch_in_NVMe_based_storage_designs.aspx
3. G2M Research (2018). G2M Research NVMe Ecosystem Market Sizing Report. G2M Research. Retrieved June 6, 2018, from: http://g2minc.com/g2m-research-nvme-ecosystem-market-sizing-report
4. Mehta N. (2015). Pushing Performance and Integration with the UltraScale+ Portfolio. Xilinx. Retrieved June 8, 2018, from: https://www.xilinx.com/support/documentation/white_papers/wp471-ultrascale-plus-perf.pdf
5. Allen D., & Metz J. (2018a). The Evolution and Future of NVMe. Bright Talk. Retrieved from: https://www.brighttalk.com/webcast/12367/290529
6. Nuncic (2017). More Speed for your SSD – NVME Expected to Replace SATA and SAS in the Future. OnTrack. Retrieved June 8, 2018, from: https://www.ontrack.com/blog/2017/09/15/nvme-replace-sata-sas/
7. Adshead A. (2017). Storage briefing: NVMe vs SATA and SAS. Computer Weekly. Retrieved June 8, 2018, from: https://www.computerweekly.com/feature/Storage-briefing-NVMe-vs-SATA-and-SAS
8. Rollins D. (2017). The Business Case for NVMe PCIe SSDs. Micron website. Retrieved from: https://www.micron.com/about/blogs/2017/july/the-business-case-for-nvme-pcie-ssds
9. Allen D., & Metz J. (2018b). On the Horizon for NVMe Technology: Q&A on the Evolution and Future of NVMe Webcast. NVM Express. Retrieved from: https://nvmexpress.org/on-the-horizon-for-nvme-technology-qa-on-the-evolution-and-future-of-nvme-webcast/
10. MaharanP. (2018). A Review of NVMe Optional Features for Cloud SSD Customization. Seagate Blog. Retrieved from: https://blog.seagate.com/intelligent/a-review-of-nvme-optional-features-for-cloud-ssd-customization/
11. Martin B. (2017). I/O Determinism and Its Impact on Datacenters and Hyperscale Applications. Flash Memory Summit 2017. Retrieved from: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/20170808_FB11_Martin.pdf
12. Leibso S. (2016). IBM and Nallatech demo CAPI Flash at OpenPOWER Summit in San Jose. Xcell Daily Blog. Retrieved June 4, 2018, from: https://forums.xilinx.com/t5/Xcell-Daily-Blog/IBM-and-Nallatech-demo-CAPI-Flash-at-OpenPOWER-Summit-in-San/ba-p/691256
13. SakalleyD. (2017). Using FPGAs to accelerate NVMe-oF based Storage Networks. Flash Memory Summit. Retrieved June 7, 2018, from: https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2017/20170810_FW32_Sakalley.pdf
14. Rollins J. D. (n.d.). Redo Log Files and Backups. Wake Forest University. Retrieved from: http://users.wfu.edu/rollins/oracle/archive.html
15. Wahl M., Hartl D., Lee W., Zhu X., Menezes E., & Tok W. H. (2018). How to Use FPGAs for Deep Learning Inference to Perform Land Cover Mapping on Terabytes of Aerial Images. Microsoft Blog.
16. Teich D. (2018). Management AI: GPU and FPGA, Why They Are Important for Artificial Intelligence. Forbes. Retrieved from: https://www.forbes.com/sites/davidteich/2018/06/15/management-ai-gpu-and-fpga-why-they-are-important-for-artificial-intelligence/#6bf2ff171599
17. Cha S. C., Chen J. F., Su C., & Yeh K. H. (2018). Blockchain Connected Gateway for BLE-Based Devices in the Internet of Things. IEEE Access. Retrieved from: https://ieeexplore.ieee.org/document/8274964/
18. Alcorn (2017). Hot Chips 2017: We’ll See PCIe 4.0 This Year, PCIe 5.0 In 2019. Tom’s Hardware. Retrieved June 8, 2018, from: https://www.tomshardware.com/news/pcie-4.0-5.0-pci-sig-specfication,35325.html
19. Caulfield L. (2018). Project Denali to define flexible SSDs for cloud-scale applications. Azure Microsoft. Retrieved June 6, 2018, from: https://azure.microsoft.com/en-us/blog/project-denali-to-define-flexible-ssds-for-cloud-scale-applications/
20. Ismail N. (2017). Flash storage: transforming the storage industry. Information Age. Retrieved June 4, 2018, from: http://www.information-age.com/flash-storage-transforming-storage-industry-123465174/
21. Lusinsky R. (2017). 11 Myths about RDMA over Converged Ethernet (RoCE). Electronic Design. Retrieved June 9, 2018, from: http://www.electronicdesign.com/industrial-automation/11-myths-about-rdma-over-converged-ethernet-roce
22. Miller R. (2017). IBM’s new Power9 chip was built for AI and machine learning. Tech Crunch. Retrieved June 8, 2018, from: https://techcrunch.com/2017/12/05/ibms-new-power9-chip-architected-for-ai-and-machine-learning/
23. Peng V. (2015). 16nm UltraScale+ Series by Victor Peng, EVP & GM. Xilinx. Retrieved June 8, 2018, from: https://www.xilinx.com/video/fpga/16nm-ultrascale-plus-series.html
24. Vaid K. (2018). Microsoft creates industry standards for datacenter hardware storage and security. Azure Blog. Retrieved from: https://azure.microsoft.com/en-us/blog/microsoft-creates-industry-standards-for-datacenter-hardware-storage-and-security/
25. Retrieved from: https://blogs.technet.microsoft.com/machinelearning/2018/05/29/how-to-use-fpgas-for-deep-learning-inference-to-perform-land-cover-mapping-on-terabytes-of-aerial-images/

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA-Accelerated NVMe Storage Solutions2018-08-30T12:00:10+00:00

NVMe Storage Acceleration Solutions at Flash Memory Summit

Flash Memory Summit 2018
Nallatech Showcases 250 Series of NVMe Storage Acceleration Solutions at Flash Memory Summit

CAMARILLO, California – August 6th, 2018 – Nallatech, a Molex Company, a leading supplier of high-performance FPGA solutions, demonstrates the capabilities of the 250 series of Accelerated NVMe Storage Solutions at the annual Flash Memory Summit, booth 844, Santa Clara Convention Center, CA.

On display at the event, the Nallatech 250 series comprises of three core products all of which are now shipping to lead customers and eco-system companies. These innovative products adhere to PCIe and U.2 form factors allowing them to be easily integrated into data center infrastructure.

“The 250S+ is a NIC-sized near-storage accelerator featuring a Xilinx UltraScale+ FPGA. PCIe Gen 4-capable, the 250S+ can be added to PCIe or CAPI-enabled server platforms for applications including database acceleration, in-line compression/encryption, checkpoint restarting and burst buffer caching,” said Craig Petrie, vice president business development of FPGA solutions at Nallatech. “Customers can leverage four M.2 NMVe SSDs coupled on-card to the FPGA, or instead break-out via OCuLink cables to allow the 250S+ to be part of a massively scaled storage array.”

“We’re pleased that Nallatech has selected the Xilinx Zynq UltraScale+ MPSoC as the processing core of the 250-SoC,” said Manish Muthal, vice president of data center business at Xilinx. “Packaged in this way, the Zynq MPSoC allows customers to create remote, disaggregated storage solutions. Available as a turnkey solution featuring Xilinx’s NVMe-over-Fabric IP, the 250-SoC provides reliable transport of NVMe frames with low latency, high throughput, and massive scalability to remote hosts.”

The third of Nallatech’s NVMe accelerator products is the 250-U2, a fully programmable accelerator featuring a Xilinx Kintex UltraScale+ FPGA and local DDR4 SDRAM memory. “Eideticom is excited to be collaborating with Nallatech on our NoLoad™ storage and analytics FPGA acceleration platform,” said Roger Bertschmann, president and co-founder of Eideticom. “Nallatech’s 250-U2 near-storage accelerator with U.2 industry standard form factor coupled with NoLoad’s NVM Express (NVMe) compatible interface provides a compelling solution for storage vendors. NoLoad’s high-throughput accelerator instances such as compression, erasure coding and search enables storage OEMs to efficiently process multi-gigabytes of data with unmatched performance.”

Please visit www.nallatech.com/storage for additional information

About Nallatech
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid computer clusters and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com

About Molex
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. www.molex.com

About Eideticom

Eideticom develops leading edge storage, compute and application acceleration products targeting programmable platforms on the cloud or at the network edge. Eideticom’s extensive experience developing and deploying enterprise grade products enables our customers and partners to confidently and successfully deliver their products to market. www.eideticom.com

NVMe Storage Acceleration Solutions at Flash Memory Summit2018-08-06T10:44:51+00:00

FPGA Acceleration of Binary Neural Networks

BNN Binary Neural Networks

DOWNLOAD WHITEPAPER: FPGA Accelerated Binary Neural Network

Deep Learning

Until only a decade ago, Artificial Intelligence resided almost exclusively within the realm of academia, research institutes and science fiction. The relatively recent realization that Deep Learning techniques could be applied practically and economically, at scale, to solve real-world application problems has resulted in a vibrant eco-system of market players.

Now, almost every application area is in some way benefiting from Deep Learning – the leveraging of Artificial Neural Networks to learn from vast volumes of data to efficiently execute specific functions. From this field of neural network research and innovation, Convolutional Neural Networks (CNNs) have emerged as a popular deep learning technique for solving image classification and object recognition problems. CNNs exploit spatial correlations within the image sets by using convolution operations. CNNs are generally regarded as the neural network of choice – especially for low-power applications because they have fewer weights and are easier to train compared to fully connected networks which demand more resources.

Neural Networks

One approach to reduce the silicon count and therefore power required to execute a high performance neural network is to reduce the dynamic range of floating-point calculations. Using 16-bit floating-point arithmetic instead of 32 bits has shown to only slightly impact the accuracy of image classification. Furthermore, depending upon the network, the accuracy of the calculation can be reduced even further to fixed point or even single bits. This trend of improving overall efficiency through implementation of reduced calculation accuracy has led to the use of binary weights i.e. weights and input activations that are binarized with only two values: +1 and -1. This new variant is known as a Binary Neural Network (BNN). It reduces all fixed-point multiplication operations in the convolutional layers and fully connected layers to 1-bit XNOR operations.

Flexible FPGAs

Established classes of conventional computing technologies have attempted to evolve at pace to cater for this dynamic market. NVIDIA, for instance, has not only adapted the underlying GPU architecture and tools, but also their product strategy and value proposition. GP-GPUs, previously marketed as the ultimate double precision floating-point engines for graphics and demanding HPC applications are now being re-positioned for the Deep Learning CNN market where half-precision arithmetic support is critical for success.

Google, one of the strongest proponents of AI, has created its own dedicated hardware architecture, the Tensor Processing Unit (TPU), which is tightly coupled with their Machine Learning framework, TensorFlow. Other industry leaders, including hyperscale innovator Microsoft, have selected Field Programmable Gate Arrays (FPGAs) for their “Brainwave” AI architecture – a pipeline of persistent neural networks that promises to deliver real-time results. This choice is no doubt linked to the confidence they gained from the highly successful (and market disrupting) use of Intel-based Arria-10 FPGAs for Bing search indexing.

This white paper explains why FPGAs are uniquely positioned to address the dynamic roadmap requirements of neural networks of all bit ranges – in particular, BNNs.

Binary Neural Networks

Processing convolutions within CNN networks requires many millions of coefficients to be stored and processed. Traditionally, each of these coefficients are stored in a full single precision representation. Research has demonstrated that coefficients can be reduced to half precision without any material change to the overall accuracy while reducing storage capacity and memory bandwidth. More significantly, this approach also shorten the training and inference time. Most of the pre-trained CNN models available today use partial reduced precision.

Figure 1 : Converting weights to binary (mean = 0.12)

By using a different approach to the training of these coefficients the bit accuracy can be reduced to a single bit, plus a scaling factor 1. During training, the floating-point coefficients are converted to binarized values and scaling a factor by averaging all output feature coefficients and subtracting this average from the original value to produce a result that is either positive or negative, represented as either 1,0 in binary notation (
Figure 1). The output of the convolution is then multiplied by the mean.

FPGA Optimizations

Firstly, binarization of the weights reduces the external memory bandwidth and storage requirements by a factor of 32. The FPGA fabric can take advantage of this binarization as each internal memory block can be configured to have a port width ranging from 1 to 32 bits. Hence, the internal FPGA resource for storage of weights is significantly reduced, providing more space for parallelization of tasks.

The binarization of the network also allows the CNN convolutions to be represented as a series of additions or subtractions of the input activations. If the weight is binary 0 the input is subtracted from the result, if the weight is binary 1 it is added to the result. Each logic element in an FPGA has addition carry chain logic that can efficiently perform integer additions of virtually any bit length. Utilizing these components efficiently allows a single FPGA device to perform tens of thousands of parallel additions. To do so the floating-point input activations must be converted to fixed precision. Given the flexibility of the FPGA fabric, we can tune the number of bits used by the fixed additions to meet the requirement of the CNN. Analysis of the dynamic range of activations in various CNNs shows that only a handful of bits, typically 8, are required to maintain an accuracy to within 1% of a floating-point equivalent design. The number of bits can be increased if more accuracy is required.

Converting to fixed point for the convolution and removing the need for multiplications via binarization dramatically reduces the logic resources required within the FPGA. It this then possible to perform significantly more processing in the same FPGA compared to a single precision or half precision implementation.

Deep Learning models are becoming deeper by adding more and more convolution layers. Having the capability to stack all these layers into a single FPGA device is critical to achieving the best performance per watt for a given cost while retaining the lowest possible latency.

FPGA Implementation

The Intel FPGA OpenCL framework was used to create the CNNs described in this paper. To optimize the design further, the Nallatech research center developed IP libraries for the binary convolution and other bit manipulation operations. This provides a powerful mix programmability and efficiency.

Table 1: Approximate Yolo V3 layers

Table 1 : Approximate Yolo V3 layers

The network targeted for this white paper was the Yolo v3 network (Table 1). This network consists largely of convolution layers and therefore the FPGA has been optimized to be as efficient at convolutions as possible.

To achieve this, the design uses a HDL block of code to perform the integer accumulations required for binary networks, making for an extremely efficient implementation.

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 : Resource requirements of BNN IP (% Arria 10 GX 1150)

Table 2 lists resource requirements for the accumulation of the 8-bit activation data when using binary weights. This is equivalent to 2048 floating-point operations, but only requires 2% of the device. Note, there is extra resource required by the FPGA to restructure the data (see Table 3), so it can be processed this way, however it does illustrate the dramatic reduction in resources that can be achieved versus a floating-point implementation.

The FPGA is also required to process the other layers of Yolo v3 to minimize the data copied over the PCIe interface. These layers require much less processing and therefore less of the FPGA resource is allocated to these tasks. In order for the network to train correctly, it was necessary for activation layers to be processed with single precision accuracy. Therefore, all layers other than the convolution are calculated at single precision accuracy.

The final convolution layer is also calculated in single precision to improve training and is processed on the host CPU. Table 3 details the resources required by the OpenCL kernels including all conversions from float to 8-bit inputs, the scaling of the output data and final floating-point accumulation.

Table 3 : Resource requirements for full Yolo v3 CNN kernel (% Arria 10 GX 1150)

FPGA Accelerator Platforms

The FPGA device targeted in this whitepaper is an Intel-based Arria-10. It is a mid-range FPGA fully supported within the Intel OpenCL Software Development Kit (SDK). Nallatech delivers this flexible, energy-efficient accelerator in the form of either an add-in PCIe card or integrated rackmount server. Applications developed in OpenCL are mapped onto the FPGA fabric using Nallatech’s Board Support Package (BSP) enabling customers (predominantly software rather than hardware focused) to remain at a higher level of abstraction than is typically the case with FPGA technology.

Nallatech’s flagship “520” accelerator card shown below features Intel’s new Stratix-10 FPGA. It is a PCIe add-in card compatible with server platforms supporting GPU-class accelerators. Ideal for scaling Deep Learning platforms cost effectively.


Each convolution block performs 2048 operations per clock cycle or ~0.5 TOPS per second for a typical Arria 10 device. 4 such kernels allow Yolo v3 to be run at a frame rate of ~8 frames sec for a power consumption of 35 Watts. This is equivalent to 57 GOPS/Watt.

XNOR Networks

It is possible to further reduce compute and storage requirements of CNNs by moving to a full XNOR network. Here both the weights and activations are represented as binary inputs. In this case a convolution is represented as a simple bitwise XNOR calculation, plus some bit counting logic. This is equivalent to the binary version described earlier except that activations are now only a single bit wide.

Speed-up of such networks is estimated at 2 orders of magnitude when running on FPGA. This disruptive performance improvement enables having multiple real-time inferences running in parallel on power efficient devices. XNOR networks require a different approach to training, where activations on the forward pass are converted to binary and a scaling factor.

Whereas binary networks show little degradation in accuracy, XNOR networks show 10-20%2 difference to a floating-point equivalent. However, this is using CNNs not designed specifically of XNOR calculations. As research into this area increases, it’s likely the industry will see new models designed with XNOR network in mind, that will provide a level of accuracy close to the best CNNs, while benefiting from the tremendous efficiency of this new approach.


This whitepaper has demonstrated that significant bit reductions can be achieved without adversely impacting the quality of application results. Binary Neural Networks (BNNs), a natural fit for the properties of the FPGA, can be up to thirty times smaller than classic CNNs – delivering a range of benefits including reductions in silicon usage, memory bandwidth, power consumption and clock speed.

Given their recognized strength for efficiently implementing fixed point computations, FPGAs are uniquely positioned to address the needs of BNNs. The inherent architecture flexibility of the FPGA empowers Deep Learning innovators and offers a fast-track deployment option for any new disruptive techniques that emerge. XNOR networks are predicted to deliver major improvements in image recognition for a range of cloud, edge and embedded applications.

Nallatech, a Molex company has over 25 years of FPGA expertize and is recognized as the market leader in FPGA platforms and tools. Nallatech’s complimentary design services allow customers to successfully port, optimize, benchmark and deploy FPGA-based Deep Learning solutions cost-effectively and with minimal risk.

Please visit www.nallatech.com or email contact@nallatech.com for further information.

This work has been partly developed as part of the OPERA project to provide offloading support for low powered traffic monitoring systems: www.operaproject.eu

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

NEW – Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

FPGA Acceleration of Binary Neural Networks2018-06-25T15:54:26+00:00

Molex Acquires BittWare, Inc.

Molex acquires Bittware 2018

May 14, 2018, Lisle, IL – Today, Molex acquired BittWare, Inc., a move that expands our capabilities in high-performance computing solutions.

BittWare designs and manufactures board-level solutions for high-end FPGA applications, signal processing, and network processing. Field-programmable gate arrays (FPGAs) are important for machine learning, artificial intelligence, IoT and other applications that require high-speed data transmission.

BittWare is based in Concord, New Hampshire, with approximately 45 employees. It will be managed by Molex’s ISI business within the DataCom & Specialty Solutions Division. The acquisition expands on the capabilities of ISI’s Nallatech and Innovative Integration product groups to address the rising demand for FPGA-based solutions.

Acquisitions like BittWare are an important part of Molex’s Vision to provide customers with innovative electronic solutions. BittWare is known within the industry for its wide breadth of in-house FPGA board, subsystem and software expertise. It has formed strong commercial relationships with FPGA market leaders Intel and Xilinx.

Molex’s expertise in high-speed datacom products, customer base, and global resources, combined with BittWare’s capabilities, will help both companies capitalize on the growth of the FPGA industry and become the most capable supplier of FPGA computing platforms.

We look forward to a strong future together and welcome BittWare and its employees to Molex!

About Nallatech
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid computer clusters and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com

About Molex
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. www.molex.com

Molex Acquires BittWare, Inc.2018-09-04T07:49:35+00:00

Nallatech Launches 250 Series of NVMe Acceleration Solutions

Nallatech Launches 250 Series of NVMe Storage Acceleration Solutions

250 Series FPGA Products

CAMARILLO, California – March 19, 2018 – Nallatech, a Molex Company, a leading supplier of high-performance FPGA solutions, announces availability of the 250 family of Accelerated Storage Solutions featuring Xilinx UltraScale+ FPGA and MPSoC technology.

“FPGAs are being deployed across a range of on-premise storage platforms and cloud infrastructure to achieve a step-change in application performance and energy-efficiency,” said Craig Petrie, vice president business development of FPGA solutions at Nallatech. “Our collaboration with Xilinx has delivered a family of innovative storage products capable of accelerating common functions such as erasure coding, deduplication, encryption and compression.  These products adhere to PCIe and U.2 form factors allowing them to be easily integrated into data center infrastructure.”

“We’re pleased that Xilinx UltraScale+ FPGAs and MPSoCs are at the core of Nallatech’s new family of accelerated storage products,” said Manish Muthal, vice president of data center business at Xilinx. “Packaging disruptive technology in this way allows customers to easily and rapidly deploy Xilinx solutions, and to take advantage of the dramatic benefits of Xilinx technology in a cost-effective manner.”

The Nallatech 250 series comprises of three core products:

250S+ — A fully-programmable NIC-sized near-storage accelerator featuring a Xilinx Kintex UltraScale+ FPGA. This PCIe Gen 4-capable accelerator card can be added to PCIe or CAPI-enabled server platforms introducing an energy-efficient acceleration capability for applications including database acceleration, in-line compression/encryption, checkpoint restarting and burst buffer caching. The 250S+ is available with a choice of two configurations. The first provides up to four M.2 NMVe SSDs coupled on-card to the Xilinx FPGA. The second offers an innovative break-out option using OCuLink cabling to allow the 250S+ to be part of a massively scaled storage array.

250-U2 — Adhering to the U.2 form factor, this fully-programmable accelerator features a Xilinx Kintex UltraScale+ FPGA and local DDR4 SDRAM memory. This energy-efficient, flexible compute node is intended to be deployed within conventional U.2 NVMe storage arrays (approximately 1:8 ratio) allowing FPGA-accelerated instances of erasure coding, deduplication and compression to boost overall system performance. The 250-U2 is available as a fully-programmable device for customers preferring to develop and deploy their own application codes.

250-SoC — The 250-SoC enables the creation of remote, disaggregated storage or Ethernet Just-a-Bunch-of-Flash (EJBOF) which dramatically reduces the storage cost, footprint and power within data centers. A Xilinx Zynq UltraScale+ MPSoC device featuring both FPGA fabric and 64-bit ARM processors coordinates data transfer between two 100GbE network ports, onboard DDR4 memory and a PCIe Gen 4 host interface. Optional OCuLink ports allow the NIC-sized accelerator to be part of a massively scaled storage array. The 250-SoC is available either fully-programmable or as a pre-programmed solution featuring Xilinx’s NVMe-over-Fabric IP. This optimized design implements the NVM Express-over-Fabrics protocol offload and RDMA NIC protocol. This turnkey solution provides reliable transport of NVMe frames with low latency, high throughput, and massive scalability to remote hosts.

Please visit www.nallatech.com/storage for additional information.

About Nallatech
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid computer clusters and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com

About Molex
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. www.molex.com

Nallatech Launches 250 Series of NVMe Acceleration Solutions2018-03-19T08:50:49+00:00

High Frequency Trading – Get the competitive edge with Nallatech FPGAs

High Frequency Trading – Get the competitive edge with Nallatech FPGAs

Nallatech’s Craig Petrie explains how financial trading can gain the competitive edge with the latest Intel Stratix 10 FPGA technology.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

High Frequency Trading – Get the competitive edge with Nallatech FPGAs2018-02-27T08:57:54+00:00

OPERA Project – Improving computational energy efficiency

OPERA Project – Improving computational energy efficiency through low power consumption systems

OPERA Project – LOw Power Heterogeneous Architecture for NExt generation of SmaRt infrastructure and Platforms in Industrial and Societal Applications. The OPERA project is co-funded by the European Union’s HORIZON 2020 Framework Programme for Research and Innovation. A new generation of low power consumption systems to improve computational energy efficiency through the development of heterogeneous architectures, distributing the workload according to applications and server technology.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

OPERA Project – Improving computational energy efficiency2018-04-11T06:52:49+00:00

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs

Nalllatech Whitepaper – Accelerating Kafka Producers with FPGAs

Introduction – Accelerate Kafka Producers with FPGAs

Apache Kafka is at the heart of emerging universal streaming data pipeline. Kafka’s has many high-profile adoptions as the streaming platform of choice being used at LinkedIn, Netflix, Uber, ING along with over one third of the Fortune 500 and growing. At LinkedIn, approximately two trillion messages per day pass through Kafka. According to TechRepublic.com, six of top 10 travel companies, seven of top 10 global banks, eight of the top 10 insurance companies and nine of top 10 telecom companies have adopted Kafka as the central platform for managing streaming data. At the 2017 New York Kafka Summit, Confluent reported over one third of the Fortune 500 have deployed Kafka.

Basic Kafka System

Kafka has three essential components – producers, brokers and consumers. Producers publish data to topics on brokers and consumers subscribe to topics. Figure 1 shows a basic Kafka system.

Figure 1 - Basic Kafka System

Figure 1 – Basic Kafka System

One of many the advantages of the Kafka architecture is the decoupling of producers and consumers. Producers and consumers can be at wildly different data rates and yet have no effect on each other. The other key advantage of Kafka is its small size. With just over 90,000 lines of code, Kafka clusters can be implemented on much more modest hardware requirements than Spark Streaming which requires a full Spark node.

Accelerating Kafka Producers

Data ingest into big data systems ranges from simple to complex. In figure 2, data source 1 may be a packet captures of network traffic. However, data source two could be complex geospatial images from a constellation of satellites, while data source three is industrial IoT maintenance data on a windmill farm in West Texas.

Figure 2 : Streaming Data Ingest Acceleration with Intel FPGAs

Figure 2 : Streaming Data Ingest Acceleration with Intel FPGAs

The variability in data formats and data rates makes the problem difficult to scale. Being able to adapt in real-time to burst in traffic and new formats is often costly requiring provisioning of additional NICs and processors. Figure 3 shows a typical processor based architecture used in most Kafka clusters.

Figure 3 : Typical Ingest Pathway

Figure 3 : Typical Ingest Pathway

Data rate variability makes the system in figure 3 difficult to plan. In many cases, the maximum bandwidth must be estimated and then provisioned. 50% or more excess processors and NICs will be idled waiting for increases in data rates.

Moving to an Intel FPGA based solution, the same maximum bandwidth will be estimated, but the simplified system in figure 4 will have much lower power while idle and requires considerable less footprint overall. The system in figure 2 will also eliminate flow control and load balance management needed for processor based system because the Intel FPGA based approach is deterministic regardless of data rate or data formats.

Intel FPGAs are streaming, parallel accelerators that attach directly to copper, fiber & optical wires. Unlike traditional GPUs and CPUs, Intel FPGAs can move any data in any format from wire to memory in nanoseconds without the need of a Network Interface Card (NIC).

This acceleration of ingest can result in 40X lower latency in data ingest to Kafka producer. It provides the option for simultaneous real-time processing of the inflowing data such as by implementing machine learning, image recognition, pattern matching, filtering, compression, encryption etc. Ingested data can be therefore accelerated and enriched to speed time to data acquisition and data analysis.

Use Case One:
Inline Extract & Transformation 

The most basic use case for FPGA ingest into a Kafka producer is shown in figure 4. Even for this most basic use case, the FPGA provides low latency and determinism for even extremely variable rates. The ability to extract and transform the data with OpenCL allows this use case to handle 10s to 100s of data types.

Figure 4 Inline, Low Latency, Deterministic, Extraction & Transformation

Figure 4 Inline, Low Latency, Deterministic, Extraction & Transformation

Use Case Two:
Inline Encryption & Decryption

Encryption is extremely expensive in processor cycle, but well understood on Intel FPGAs. FPGAs provide a low latency and deterministic result without a dependency on the data rate. For processors, variable data rates could flood the processor resources and cause a bottleneck and/or start dropping packet.

Figure 5 Inline, Low Latency, Deterministic, Encryption or Decryption

Figure 5 Inline, Low Latency, Deterministic, Encryption or Decryption

Use Case Three:
Inline Compression & Decompression

FPGA’s are extremely efficient at compression and decompression. In this use case the FPGA is used to compress/decompress data before it is passed to the Kafka system.

Figure 6 Inline, Low Latency, Deterministic Compression or Decompression

Figure 6 Inline, Low Latency, Deterministic Compression or Decompression

Use Case Four:
Information Theory with Encrypted/Decrypted &
Compressed/Decompressed Streams

Shannon’s law is being applied to more streaming use cases to determine if a stream is encrypted. Shannon’s law calculates the entropy of a packets looking for randomness versus structured bytes. Many encrypted bytes look, but not all, similar structured data. Figure 7 shows a possible flow to calculate the entropy, attempt to decrypt and then decompress before being published to a Kafka topic. Even if the decryption and/or decompress could not be done successfully, sorting encrypted vs decrypted streams has many applications in industries, such as personal identifiable information like finance and health care.

Use Case Five:
Enriched Topic Routing

Figure 8 Enriched Topic Routing of PCAPs for Cyber Analytics

Figure 8 Enriched Topic Routing of PCAPs for Cyber Analytics

Kafka’s flexible topic architecture that allows ingested data to be placed into many topics. This flexibility means incoming data can be routed/switched using machine learning and pattern matching. Take figure 9 above which shows raw network packets being captured (PCAPS). As the packets are captured, complex pattern matching using PCRE expressions can route to the appropriate topics. This allows the Kafka consumers to subscribe to enriched topics and bypass a cleaning stage. For many cyber analytics applications, the processing realizes a 1000X improvement in cyber operations per watt based on research published by DOE Sandia & Lewis Rhodes Labs.

Nallatech 385A Cloudera/Intel Example

The Nallatech 385A provides two network ports supporting up to 40Gbe/sec each. This NIC size card can replace existing NIC/CPU combination to significantly accelerate existing Kafka networks and reduce power.

This has been verified by Cloudera and Intel to accelerate Kafka to Spark streaming, whilst performing data enrichment on the FPGA (Figure 9).

Figure 9 Enriched data using 385A

Figure 9 Enriched data using 385A

In the above demonstration, we have chosen engine noise signatures as our input data stream. They are ingested and offloaded via an UDP offload engine and placed into the card’s OpenCL environment. OpenCL code running on the card performs real-time formatting on the incoming data stream. It then performs an FFT, feature extraction and classifies the signal as “normal” or “abnormal” based on comparison with known engine signatures. This extra bit of data along with the FFT of the engine signals are DMA into Kafka for further processing.

This example also highlights the flexibility of OpenCL generated libraries which can be applied to incoming streaming data. This offers then end user immense latitude to include very application specific forms of data enrichment or data filtering.

520N: 100 Gbe with Stratix10

The Nallatech 520N four network ports enable support for an array of serial I/O protocols operating up at 10/25/40/100Gz. With a total throughput of up to 400 Gbe/sec, the 520N is cable of enriching high volumes of data prior to offloading to a Kafka framework.

Figure 10 Enriched data using 520N

Figure 10 Enriched data using 520N

The 520N is populated with the powerful Stratix 10 FPGA offering unparalleled performance.
With the combination of high throughput, large amounts of compute and programmability using OpenCL, it is possible to perform complex data enrichment on streaming data on a single device.

More Information and How to Evaluate

Nallatech along with Intel PSG are experts at Kafka acceleration. Nallatech has current and planned products to accelerate Apache Kafka using Arria 10 and Stratix 10 FPGAs. Please contact Nallatech to discuss your needs and develop an accelerated solution.

View All Nallatech FPGA Cards

FPGA Accelerated Compute Node

FPGA Accelerated Compute Node – with up to (4) 520s

520 – with Stratix 10 FPGA

Compute Accelerator Card
w/Stratix 10 FPGA

510T - Compute Accelerator with Arria 10 FPGA

 Nallatech 510T
w/(2) Arria 10  FPGAs

385A - Network Accelerator with Arria 10 FPGA

 Nallatech 385A – w/Arria10 / GX1150 FPGA

Any Rate, Any Format: Accelerating Kafka Producers with FPGAs2017-11-14T08:04:01+00:00

Nallatech exhibiting at SuperComputing 17

Nallatech Showcases Next Generation FPGA Accelerators at Supercomputing 2017

Leaders in FPGA AccelerationVisit booth 1362 for Machine Learning and Kafka Data Ingest case studies using latest generation of FPGA accelerators and tools

LISLE, IL – November 13, 2017 – Nallatech, a Molex company, will showcase FPGA solutions for high-performance computing (HPC), low latency network acceleration and data analytics at the Supercomputing 2017 (SC17) Conference and Exhibition, November 13-16 in Denver, Colorado.

FPGA Acceleration Card with Stratix 10 FPGA
“FPGAs are being deployed in volume across a range of on-premise platforms and cloud infrastructure to achieve a step-change in application performance and energy-efficiency above and beyond what can be achieved using conventional processor technologies” said Craig Petrie, VP Business Development of FPGA Solutions at Nallatech. “We’re excited to be showcasing our new OpenCL-programmable ‘520’ product range featuring Intel Stratix-10 FPGAs. These server-qualified accelerator products have been engineered to cost-effectively solve demanding co-processing and real-time data ingest and enrichment applications.”

Nallatech will present two example applications featuring latest hardware and tools where FPGAs demonstrate significant value to customers:

Convolutional Neural Networks (CNN) – Object classification using a low profile Nallatech 385A™ PCIe accelerator card with a discrete Intel Arria 10 FPGA accelerator programmed using Intel’s OpenCL Software Development Kit. Built on the BVLC Caffe deep learning framework, an FPGA interface and IP accelerate processing intensive components of the algorithm. Nallatech IP is capable of processing an image through the AlexNet neural network in nine milliseconds. The Arria10-based 385A™ board has the capacity to process six CNN images in parallel allowing classification of 660 images per second.

KAFKA Ingest/Egress – Acceleration of KAFKA Producers using the advanced capabilities of Intel’s new Stratix-10 FPGA silicon and OpenCL Software Development Kit (SDK). This case study describes an analytic framework that provides up to 40 times increase in ingest performance enabling real-time data filtering and enrichment.

Additionally, Nallatech will display a range of leading-edge technologies at SC17 including:

520N™ Network Accelerator Card — A GPU/Phi-sized 16-lane PCIe Gen 3 card sporting four 100G network ports directly coupled to an Intel Stratix-10 FPGA. Four independent banks of DDR4 memory complete the balanced architecture capable of handling latency-critical 100G streaming applications.

520C™ Compute Acceleration Card – A GPU/Phi-sized 16-lane PCIe Gen 3 card, the OpenCL-programmable 520C™ features an Intel Stratix-10 FPGA designed to deliver ultimate performance per watt for compute-intensive HPC workloads.

About Nallatech:
Nallatech, a Molex company, is a leading supplier of accelerated computing solutions. Nallatech has deployed several of the world’s largest FPGA hybrid compute clusters, and is focused on delivering scalable solutions that deliver high performance per watt, per dollar. www.nallatech.com.

About Molex, LLC
Molex brings together innovation and technology to deliver electronic solutions to customers worldwide. With a presence in more than 40 countries, Molex offers a full suite of solutions and services for many markets, including data communications, consumer electronics, industrial, automotive, commercial vehicle and medical. For more information, please visit http://www.molex.com.

Nallatech exhibiting at SuperComputing 172017-11-14T08:36:03+00:00

OpenCapi Blog: Post 1

Datacentric Architectures

Molex/Nallatech Leverages OpenCAPI
for 200GBytes/s of Hyperconverged
NVMe Storage Bandwidth
By Allan Cantle

Over the last decade the computing industry has managed to deliver application performance improvements and better energy-efficiency for its customers by embracing parallelism, co-processor type acceleration and techniques to bypass and unburden the CPU. These have worked on the premise of maintaining the CPU centric nature of the server while effectively adding data centric enhancements.

To maintain this rate of incremental improvement, the industry is now embracing many more system level enhancements to the fundamental computing architecture and the CPU is becoming an important member in a fundamentally data centric architecture, rather than being at the heart of that architecture.  With this architectural shift the network fabric is becoming the critical piece at the center and we can see this evidenced by the plethora of new fabric standards including Omnipath, NVLink, OpenCAPI, GenZ, CCIX and Infinity fabric to name a few. Each of these fabrics claim to either solve a piece of or all the communication requirements for future data centric architectures.

OpenCAPI is enjoying the early mover advantage as an excellent open standard conduit, both metaphorically and physically, in facilitating this data centric industry shift. This becomes even more important when you realize that the industry cannot leave behind CPU centric legacy software that will need to continue running for many decades to come.

It is critical to understand that OpenCAPI is singularly focused on being the best coherent, low latency and high bandwidth (25GBytes/S Tx & 25GBytes/s Rx) interconnect for the hyperconvergence of data centric architectural pieces within a node. Consequently, it is looking for a complimentary fabric to support the ingress and egress of data to and from the node. This will be a topic for a later blog.  OpenCAPI based hyperconverged solutions must also become more programmable in a similar vein to those developed earlier on CAPI such as CAPI SNAP, Storage Networking & Acceleration Programming, and framework.

Nallatech is a pioneer of data centric computing using FPGAs, where computational functions are built around flowing data streams. It has 24 years of experience in successfully helping customers to migrate and deploy data centric heterogeneous architectures featuring FPGA technology. OpenCAPI was designed to leverage the strengths of FPGA architectures and minimize the impact of their weaknesses. Figure 1 shows a block diagram of Nallatech’s perspective of how the OpenCAPI bus is at the heart of enabling the true emergence of data centric architectures.


Figure 1 OpenCAPI enabling Data Centric architectures through a Hyperconverged & Disaggregatable Architecture

Critical to this industry transformation is the open collaborations of all the industries experts with their differing skillsets. This openness, especially at the interface level, will help to ensure that the best ideas win out and that everyone can innovate around these new standards to deliver the best solutions to the industries customer base including the essential software infrastructure stacks that will make this technology easily accessible to application developers.

With Nallatech’s data centric heritage, Molex & Nallatech are taking decades of experience in tackling complex data centric problems.  These include HPDA applications such as video analytics & AI to classical memory bound HPC problems like the seismic migration algorithms.  These new system level solutions, based around OpenCAPI, will deliver over 5x performance gains at power levels that realistically begin to approach the DOEs 20MW Exascale target.

Additionally Nallatech will leverage OpenCAPI to ensure that valuable memory resources can be effectively shared with the CPU without breaking the essential support of the legacy CPU centric code base.

Come by the OpenCAPI, Molex & Nallatech booths #1587-#1589, #1263 & #1362 where we will be showcasing how our Sawmill FSA (Flash Storage Accelerator) development platform brings up to 200GBytes/s of hyperconverged accelerated storage to the Google/Rackspace Zaius/Barreleye-G2 POWER9 OCP Platform. The Sawmill FSA is designed to natively support the benefits of OpenCAPI by providing the lowest possible latency and highest bandwidth to NVMe Storage with the added benefits of OpenCAPI Flash functionality and near storage FPGA acceleration. HPDA applications such as graph analytics, in-memory databases and bioinformatics are expected to benefit greatly from this platform.

OpenCapi Blog: Post 12017-11-13T06:32:06+00:00
Password Reset
Please enter your e-mail address. You will receive a new password via e-mail.