In part two of this three-part series, Semiconductor Engineering Editor in Chief Ed Sperling and Suresh Andani, Senior Director, Product Marketing and Business Development at Rambus, discussed early market adoption of PCIe 5, as well as the networking environment the specification will support in the data center. In this blog post, Sperling and Andani explore the compute and storage demands that necessitate PCIe 5.
“Once the data [is fed] into the CPU, it has to be processed. Workloads such as HPC and AI/ML require a lot of computing. As you can see [in the video below], the ToR switch to the NIC is going to 400G Ethernet – and the link between the NIC and CPU is PCIe 5,” he explains.
“Now the CPU has a linear architecture. With multi-threading, hyper-threading and adding more cores, you can do a bit more parallel processing. However, applications such as convolutional neural networks (CNNs) and deep neural networks (DNNs) used in AI/ML workloads such as video transcoding and image processing require a massively parallel compute architectures. This is where we are seeing more and more accelerators go into hyperscale servers.”
As Andani observes, there are multiple neural networks for various applications.
“For image recognition, you’re going to use a CNN or DNN. If you look at voice recognition or natural language processing, these are different types of AI workloads, so the algorithms needed to run these workloads are different,” he states. “That is why you see different types of neural networks, but they all work in parallel, depending on what kind of workload you’re running.”
Andani continues, pointing out that although the data is fed to the CPU (via PCIe 5), the CPU is assisted by accelerators which themselves will interface with the CPU via PCIe 5.
“These could be GPUs that you see a lot in AI training type of applications. You are also starting to see a lot of FPGAs, specifically for high-performance compute workloads like video transcoding, although FPGAs are also quite good from an AI inference perspective. In addition, some companies are building dedicated or custom ASICs,” he explains.
“With all accelerators, the way to connect to the CPU is via PCIe. With older workloads, which were not as sophisticated, it was sufficient to use PCIe 4 bandwidth. However, new sophisticated workloads like AI/ML require a lot of bandwidth over PCIe 5.”
“There is a distinction between certain workloads that are very latency sensitive, for example, real time AI. If you are running real time AI workload in a cloud, you cannot tolerate a lot of latency between various compute nodes. That is where some new cache coherent protocols like Compute Express Link (CXL) and CCIX are gaining more steam.”
On the storage side, says Andani, data centers that did not have a hyperconverged architecture typically maintained storage servers that were SAS/SATA based running over fiber channel. However, in the modern data centers these servers are becoming hyperconverged to save on TCO.
“It makes a lot of sense to have a storage drive like an SSD flash drive (which is quite fast) that can be plugged into the same compute server, so you don’t need extra storage servers. That is why you are seeing more and more NVMe-based SSD drives on the market,” he elaborates. “With more and more ultra-high definition videos being stored and accessed on demand, you cannot tolerate a lot of latency of this transfer, like video being transferred from the SSD drive to the CPU back to the network. So, you need a very fast and efficient interface between these SSD drives and the CPUs. Hence, these are basically now PCIe 5-based interfaces between NVMe SSDs and the CPU.”
According to Andani, this is typically an x4 link (between NVMe SSDs and the CPU), versus the SmartNIC and the accelerators which are typically x8 or x16. Because of the small form factor and limited space, SSDs are typically put on an x4 slot.
“The key point: PCIe is driving the speeds and feeds needed for very fast video or storage access to the NVMe drives between the CPU and the SSD controllers. From a storage perspective, the videos are [scaling to] higher and higher resolution, which means the interface between the controllers and the CPU must become faster and faster.”
As Andani points out, there are two ways of increasing the bandwidth between the SSD and the CPU.
“One, you can take the PCIe 3 or PCIe 4 that you have and go from an x4 form factor to an x8 or x16. However, you are very space-limited in a very crowded server and cannot afford to go to x8 or x16 for storage,” he says. “This is where the solution is to keep the x4 form factor, but to get the double bandwidth, you double the speed [by moving from] PCIe 4 to PCIe 5.”
In terms of thermal impact, Andani observes that the increase in data bandwidth, computation and number of IOPS (Input/Output Operations Per Second), corresponds to an increase in heat.
“Your TDP per chip, whether it’s an accelerator or a storage drive or even a SmartNIC, is going higher and higher. Luckily, thermal cooling technologies are also evolving. We are seeing liquid cooling in more and more data centers, which is passing liquid over the hot chips. As well, we are also seeing racks immersed in liquid, which is quite sophisticated,” he explains. “The thermal cooling technology needs to keep up with [these trends]. PCIe 5 only solves the data bandwidth interface challenge, but there are many other problems that we need to solve collectively in a data center, thermal challenges being one of them.”
Andani concludes his conversation with Sperling by walking through a detailed summary chart.
“Here we see a server and an x16 slot where you have the NIC or the SmartNIC – this is the one that gets connected to the ToR. You have a couple of x4 slots on a server where you are connecting your storage and then there is another x16 slot for an accelerator like a GPU, ASIC or FPGA,” he states.
“This illustrates the different PCIe interfaces on a server like we [previously discussed]. So, between the CPU and the SmartNIC, you are running PCIe. Currently, there are 40G and 50G servers running PCIe 3 (x8). 100G servers (that are ramping now) are using PCIe 4 (x8). In the architecture phase now are the 200G and 400G servers that, as we did the bandwidth calculation a few moments ago, will require PCIe 5 (x16), especially for 400G ethernet.”
Andani continues, pointing out the PCIe link between the CPU and the accelerator.
“Depending on the workloads that were running in the past, PCIe 3 or PCIe 4 was enough. However, with modern workloads like AI/ML and HPC, you are going to need the PCIe 5 (x16) bandwidth – and that is where we are going to see PCIe 5 deployed. Finally, storage is moving towards a U.2 form factor (SSD) which is moving from PCIe 3 to PCIe 4 and PCIe 5 (x4),” he concludes.