Frank Ferro, Senior Director Product Management at Rambus, recently penned an article for Semiconductor Engineering that takes a closer look at high bandwidth memory (HBM) and 2.5D (stacking) architecture for AI/ML training. As Ferro notes, the impact of AI/ML increases daily – impacting nearly every industry across the globe.
“In marketing, healthcare, retail, transportation, manufacturing and more, AI/ML is a catalyst for great change,” he explains. “This rapid advance is powerfully illustrated by the growth in AI/ML training capabilities which have since 2012 grown by a factor of 10X every year.”
According to Ferro, AI/ML neural network training models can currently exceed 10 billion parameters, although this number will soon jump to over 100 billion. This is made possible by enormous gains in computing power thanks to Moore’s Law and Dennard scaling.
“At some point, however, the trend line of processing power, doubling every two years, would be overtaken by one that doubles every three-and-a-half months,” he elaborates. “That point is now. To make matters worse, Moore’s Law is slowing, and Dennard scaling has stopped, at a time when arguably we need them most.”
With no slackening in demand, says Ferro, it will take improvements in every aspect of computer hardware and software to stay on pace.
“Among these, memory capacity and bandwidth will be critical areas of focus to enable the continued growth of AI. If we can’t continue to scale down (via Moore’s Law), then we’ll have to scale up,” he states. “[This is precisely why] the industry has responded with 3D-packaging of DRAM in JEDEC’s High Bandwidth Memory (HBM) standard. By scaling in the Z-dimension, we can realize a significant increase in capacity.”
As Ferro points out, the latest iteration of HBM – HBM2E – supports 12-high stacks of DRAM with memory capacities of up to 24 GB per stack. However, that greater capacity would be useless to AI/ML training without rapid access. As such, the HBM2E interface provides bandwidth of up to 410 GB/s per stack. In real world terms, this means an implementation with four stacks of HBM2E memory can deliver nearly 100 GB of capacity at an aggregate bandwidth of 1.6 TB/s.
Ferro also emphasizes that with AI/ML accelerators deployed in hyperscale data centers, it is critical to take heat dissipation issues and power constraints into consideration.
“HBM2E provides very power efficient bandwidth by running a ‘wide and slow’ interface. Slow, at least in relative terms, HBM2E operates at up to 3.2 Gbps per pin. Across a wide interface of 1,024 data pins, the 3.2 Gbps data rate yields a bandwidth of 410 GB/s,” he explains. “To data add clock, power management and command/address, and the number of ‘wires’ in the HBM interface grows to about 1,700.”
Since this is far more than can be supported on a standard PCB, a silicon interposer is used as an intermediary to connect memory stack(s) and processor. In simple terms, this means the use of a silicon interposer is what makes this a 2.5D architecture. As with an IC, finely spaced traces can be etched in the silicon interposer to achieve the number needed for the HBM interface.
“With 3D stacking of memory, high bandwidth and high capacity can be achieved in an exceptionally small footprint. In data center environments, where physical space is increasingly constrained, HBM2E’s compact architecture offers tangible benefits,” he continues. “Further, by keeping data rates relatively low, and the memory close to the processor, overall system power is kept low.”
According to Ferro, HBM2E memory delivers what AI/ML training needs, with high bandwidth, high capacity, compactness, and power efficiency. But there is a catch – as the design trade-off with HBM is increased complexity and costs. More specifically, the silicon interposer is an additional element that must be designed, characterized, and manufactured.
“3D stacked memory shipments pale in comparison to the enormous volume and manufacturing experience built up making traditional DDR-type memories,” he states. “[This is because] implementation and manufacturing costs are higher for HBM2E than for a high-performance memory built using traditional manufacturing methods such as GDDR6 DRAM.”
Nevertheless, overcoming complexity through innovation is what the semiconductor industry has done time and again to push computing performance to new heights. With AI/ML, the economic benefits of accelerating training runs are enormous. This is not only for better utilization of training hardware – but because of the value created when trained models are deployed in inference engines across millions of AI-powered devices.
In addition, says Ferro, designers can greatly mitigate the challenges of higher complexity with their choice of IP supplier.
“Integrated solutions such as the HBM2E memory interface from Rambus ease implementation and provide a complete memory interface sub-system consisting of co-verified PHY and digital controller,” he explains. “Further, Rambus has extensive experience in interposer design with silicon-proven HBM/HBM2 implementations benefiting from Rambus’ mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities.”
As Ferro observes, the progress of AI/ML has been breathtaking in recent years, and improvements to every aspect of computing hardware and software will be needed to keep this scorching pace on track.
“For memory, AI/ML training demands bandwidth, capacity and power efficiency all in a compact footprint. HBM2E memory, using a 2.5D architecture, answers AI/ML training’s call for ‘all of the above’ performance,” he concludes.
Leave a Reply