Rambus’ Suresh Andani has written a detailed Semiconductor Engineering article that explores how PCIe 5 can effectively accelerate AI and ML applications. According to Andani, the rapid adoption of sophisticated artificial intelligence/machine learning (AI/ML) applications and the shift to cloud-based workloads has significantly increased network traffic in recent years. However, the paradigm of virtualization can no longer keep up with the AI/ML applications and cloud-based workloads that are quickly outpacing server compute capacity.
“AI workloads – including machine learning and deep learning – require a new generation of computing architectures,” he explains. “This is because AI applications generate, move and process massive amounts of data at real time speeds. For example, a smart car generates around 4TB of data per day, while AI and ML training model sizes continue to double approximately every 3-4 months.”
As Andani notes, AI applications across multiple verticals are demanding significant amounts of memory bandwidth to support the processing of extremely large data sets. Moreover, unlike traditional multi-level caching architectures, AI applications require direct and fast access to memory.
“Additional characteristics and requirements of AI-specific applications include parallel computing, low-precision computing and empirical analysis assumption,” he explains. “AI/ML workloads are [also] extremely compute intensive – and they are shifting system architecture from traditional CPU-based computing towards more heterogenous/distributed computing.”
Looking beyond AI/ML applications, says Andani, the conventional data center paradigm is evolving due to the ongoing shift to cloud computing.
“Enterprise workloads are moving to the cloud: 45% were cloud-based in 2017, while over 60% were cloud-based in 2019. As such, data centers are leveraging hyperscale computing and networking to meet the needs of cloud-based workloads,” he elaborates. “Because the economies of scale are driven by increasing the bandwidth per physical unit of space, this new cloud-based model (along with AI/ML applications) is accelerating the adoption of higher speed networking protocols that double in speed approximately every two years: 100GbE ->200GbE-> 400GbE->800GbE.”
The steady march towards 400GbE cloud networking and the evolution of sophisticated AI/ML workloads is pushing the need for doubling the PCIe bandwidth every two years to effectively move data between compute nodes.
“PCIe5 – with an aggregate link bandwidth of 128GB/s in a x16 configuration – addresses these demands without ‘boiling the ocean’ as it is built on the proven PCIe framework,” he explains. “Essentially, the PCIe interface is the backbone that moves high-bandwidth data between various compute nodes (CPUs, GPUs, FPGAs, custom-build ASIC accelerators) in a heterogenous compute setup.”
For system designers, says Andani, significant signal integrity experience is required to support the latest networking protocols like 400GbE.
“The performance of SoCs is contingent upon how fast data can be moved in, out and between other components. Because the physical size of SoCs remain approximately constant, bandwidth increases are primarily achieved by increasing the speed (data rate) of data per pin. Issues related to higher speeds – such as loss, cross talk and reflections – all become more pronounced as data rates increase,” he adds.
As Andani emphasizes, significant increases in speed are required to support AI/ML applications such as massive training models and real-time inference.
“This means that all supporting technologies – such as CPU, memory access bandwidth and interface speeds – need to double every 1-2 years. PCIe 5.0, the latest PCIe standard, represents a doubling over PCIe 4.0: 32GT/s vs. 16GT/s, with a x16 link bandwidth of 128 GBps.”
To effectively meet the demands of AI/ML applications and cloud-based workloads, says Andani, a PCIe 5.0 interface should be a comprehensive solution built on an advanced process node such as 7nm (FINFET). In addition, the solution should comprise a co-verified PHY and digital controller. As well, the PCIe 5.0 interface should support Compute Express Link (CXL) connectivity between host processor and workload accelerators for heterogenous computing.
“The introduction of CXL (which uses the same transport layer as PCIe5) provides high-performance computing (HPC) and AI/ML system designers with a low-latency cache- coherent interconnect to virtually unify the system memory across various compute nodes,” he elaborates.
Additional key features and capabilities should include:
- 32 GT/s bandwidth per lane with 128 GB/s bandwidth in x16 configuration
- Backward compatibility to PCIe 4.0, 3.0 and 2.0
- Advanced multi-tap transceiver and receiver equalization to compensate for more than 36dB of insertion loss
“PCIe 5.0, the latest PCIe standard, represents a doubling over PCIe 4.0: 32GT/s vs. 16GT/s, with an aggregate x16 link bandwidth of 128 GBps. At these speeds, it is important for systems designers to have significant signal integrity experience to prevent loss, crosstalk and reflections,” Andani concludes.