Frank Ferro, Senior Director Product Management at Rambus, has written a detailed article for Semiconductor Engineering that explains why HBM2E is a perfect fit for Artificial Intelligence/Machine Learning (AI/ML) training. As Ferro points out, AI/ML growth and development are proceeding at a lighting pace. Indeed, AI training capabilities have jumped by a factor of 300,000 (10X annually) over the past 8 years. This trend continues to drive rapid improvements in nearly every area of computing, including memory bandwidth capabilities.
HBM: A Need for Speed
Introduced in 2013, High Bandwidth Memory (HBM) is a high-performance 3D-stacked SDRAM architecture.
“Like its predecessor, the second generation HBM2 specifies up to 8 memory die per stack, while doubling pin transfer rates to 2 Gbps,” Ferro explains. “HBM2 achieves 256 GB/s of memory bandwidth per package (DRAM stack), with the HBM2 specification supporting up to 8 GB of capacity per package.”
As Ferro notes, JEDEC announced the HBM2E specification in late 2018 to support increased bandwidth and capacity.
“With transfer rates rising to 3.2 Gbps per pin, HBM2E can achieve 410 GB/s of memory bandwidth per stack,” he explains. “In addition, HBM2E supports 12‑high stacks with memory capacities of up to 24 GB per stack.”
As Ferro points out, all versions of HBM run at a relatively low data rate compared to a high-speed memory such as GDDR6.
“High bandwidth is achieved [using] an extremely wide interface. Specifically, each HBM2E stack running at 3.2 Gbps connects to its associated processor through an interface of 1,024 data ‘wires,’” he adds.
With command and address, says Ferro, the number of wires increases to about 1,700, which is far more than can be supported on a standard PCB. As such, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. As with an SoC, finely spaced data traces can be etched in the silicon interposer to achieve the desired number of wires needed for the HBM interface.
HBM2E, Ferro emphasizes, offers the capability to achieve tremendous memory bandwidth. More specifically, four HBM2E stacks connected to a processor can collectively deliver over 1.6 TB/s of bandwidth.
“With 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low,” he adds.
HBM Design Tradeoffs
Unsurprisingly, the design tradeoffs around HBM are increased complexity and costs. Specifically, says Ferro, the interposer is an additional element that must be designed, characterized, and manufactured.
“3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories (including GDDR),” he explains. “The net is that implementation and manufacturing costs are higher for HBM2E than for memory using traditional manufacturing methods as in GDDR6 or DDR4.”
However, Ferro emphasizes, the benefits of HBM2E make it the superior choice for AI training applications.
“The performance is outstanding, and higher implementation and manufacturing costs can be traded off against savings of board space and power,” he elaborates. “In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits. Its lower power translates to lower heat loads for an environment where cooling is often one of the top operating costs.”
For training, says Ferro, bandwidth and capacity are “critical” requirements. This is particularly so given that training capabilities are on a pace to double in size every 3.43 months.
“Training workloads now run over multiple servers to provide the needed processing power – flipping virtualization on its head,” he explains. “Given the value created through training, there is a powerful ‘time-to-market’ incentive to complete training runs as quickly as possible. Furthermore, training applications run in data centers increasingly constrained for power and space, so there is a premium for solutions that offer power efficiency and smaller size.”
Given all these requirements, HBM2E is an ideal memory solution for AI training hardware. It provides excellent bandwidth and capacity capabilities: 410 GB/s of memory bandwidth with 24 GB of capacity for a single 12‑high HBM2E stack. Its 3D structure provides these features in a very compact form factor and at a lower power thanks to a low interface speed and proximity between memory and processor.
According to Ferro, this means designers can both realize the benefits of HBM2E memory and mitigate implementation challenges through their choice of IP supplier.
“Rambus offers a complete HBM2E memory interface sub-system consisting of a co-verified PHY and controller. An integrated interface solution greatly reduces implementation complexity,” he states. “Further, Rambus’ extensive mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities help ensure first-time-right design execution.”
As Ferro concludes, the growth of AI/ML training capabilities requires sustained and across the board improvements in both hardware and software to stay on the current pace. As part of this mix, memory is a critical enabler.
“HBM2E memory is an ideal solution, offering bandwidth and capacity at low power in a compact footprint [that] hits all of AI/ML training’s key performance requirements. With a partner like Rambus, designers can harness the capabilities of HBM2E memory to supercharge their next generation of AI accelerators,” he adds.
Leave a Reply