As our last blog entry illustrated, moving data from storage devices to server DRAM is one of the greatest bottlenecks to HPC application performance today because we are limited by the speed of the PCI Express (PCIe) bus. To put this in perspective, moving a petabyte of data over a 32-lane PCIe 3.0 bus takes over nine hours. Even with the speeds expected from PCIe 5.0 (which is not out yet), it still takes roughly two hours to move a petabyte of data. If that sounds bad, think how long it would take to move the same data from a storage array to servers across one 100 gigabit Ethernet (GbE) link (roughly 29 hours, if you were wondering). If you can afford for your multi-million-dollar HPC cluster is just sitting around most of the time while data moves from storage to memory, stop reading now. If however you want to learn how to get more application runs per day, one technology to look at is computational storage.
Computational storage accelerates applications by running them inside of the storage device (typically a flash solid-state drive). In this sense, it is a lot like map reduce for big data, where the applications are moved to where the data is rather than vice versa (except on a much smaller scale). Computational storage devices contain one or more processors (typically an ARM processor, potentially with other accelerator hardware as well) which can operate directly on the data in the storage device. Typical operations that are run on a computational storage device include searches, lookups, encryption/decryption, inference, and artificial intelligence applications. This enables the computational storage device to either provide results directly, or to move a much smaller set of data to the server, considerably reducing data movement times.
Granted, not every HPC application will be accelerated by computational storage, nor is it a replacement for other HPC accelerators such as GPGPUs for all HPC problems. That said, there are a variety of HPC applications today that can take advantage of computational storage to accelerate their performance. We recently ran the Facebook Artificial Intelligence Similarity Search (FAISS) application on our previous-generation computational storage device, the Catalina-2 SSD. Whereas processing time went up exponentially with dataset size in standard processing architectures, the processing times a server utilizing Catalina-2 SSDs maintained a nearly-constant processing time (see the video of our demo here). In our next blog, we will explore what types of applications are candidates for similar acceleration by using computational storage.