Esthela Gallardo and Patricia J. Teller have penned an article for HPC Wire that explores the various challenges associated with cross-accelerator performance profiling. As Gallardo and Teller note, high performance computing (HPC) systems are comprised of multiple compute nodes interconnected by a network.
“Previously these nodes were composed solely of multi-core processors, but nowadays they also include many-core processors, which are called accelerators,” the authors explained.
“Although accelerators provide higher levels of parallelism, their inclusion in HPC systems results in more complex system designs and increases the difficulty of quantifying the runtime behavior of applications that employ them.”
According to Gallardo and Teller, this can be attributed to the fact that performance metrics are used to explain application runtime behavior, although there is no consistency in the number or types of metrics exposed by dissimilar computing devices. In addition, differences in device architecture design can make it impossible to directly compare exposed metrics.
“[Nevertheless], in the case of processor design, the needs of applications will continue to drive the introduction of new device architectures. As a result, HPC systems will become more heterogeneous and application developers will have a wide array of fast computing devices on which to run their applications,” the authors continued. “However, without a means to compare the performance of devices with different architectures, it will remain difficult to determine whether a program should be launched solely on multi-core processors or should employ accelerators, which accelerators should be employed and how new accelerators should be designed.”
Commenting on the above, Steven Woo, VP of Systems and Solutions at Rambus, told us that accurately comparing the performance of acceleration devices with disparate architectures will remain a significant industry challenge for the foreseeable future. Nonetheless, says Woo, understanding the advantages of a specific architecture is key to addressing this critical issue.
“For example, GPUs are perhaps best suited for applications such as visualization, graphics processing, various types of scientific computations and machine learning,” he told Rambus Press. “The combination of numerous parallel pipelines with high bandwidth memory makes GPUs the compute engine of choice for these types of applications.”
For other types of workloads, says Woo, FPGAs may be the most appropriate choice.
“When paired with traditional CPUs, FPGAs are capable of providing application-specific hardware acceleration that can be updated over time,” he added. “In addition, applications can be partitioned into segments that run most efficiently on the CPU and other parts which run most efficiently on the FPGA.”
As Woo points out, flexibility is only one of the advantages associated with using FPGAs, as field programmable gate arrays can also be attached to the very same type of memories as CPUs.
“It’s actually a very flexible kind of chip. For a specific application or acceleration need, FPGAs are capable of providing optimized performance and improved energy efficiency. Moreover, the ease of developing something quickly to test out new concepts makes FPGAs an ideal platform for innovation,” he explained. “In fact, this is why some design teams start with an FPGA, then turn it into an ASIC to get a hardened version of the logic they put into an FPGA. They start with an FPGA to see if that market grows. That could justify the cost of developing an ASIC.”
In addition to offering versatility, says Woo, reprogrammable and reconfigurable FPGAs can be outfitted with a wide range of algorithms without going through a difficult and costly design process typically associated with ASICs. Meanwhile, the flexible nature of FPGAs allows the silicon to be easily reconfigured to meet the needs of changing application demands.
“When paired with traditional CPUs, FPGAs are capable of providing application-specific hardware acceleration that can be updated over time,” he stated. “Applications can also be partitioned into parts that run most efficiently on the CPU and other parts which run most efficiently on the FPGA.”
From a broader perspective, says Woo, the industry is going through a big learning cycle as new hardware and computing paradigms emerge.
“In machine learning applications, for example, we’re seeing GPUs being used for neural network training and FPGAs being used for inference. GPUs and FPGAs offer different advantages for various phases of the machine learning process, so both are being used at the appropriate times,” he concluded.
Interested in learning more about smart data acceleration? You can check out our article archive on the subject here.