Semiconductor Engineering Editor in Chief Ed Sperling recently spoke with Frank Ferro, Senior Director of Product Management at Rambus, about accelerating AI/ML applications in the data center with HBM3. Introduced by JEDEC in early 2022, the latest iteration of the high bandwidth memory standard increases the per-pin data rate to 6.4 Gigabits per second (Gb/s), double that of HBM2.
HBM3 maintains the 1024-bit wide interface of previous generations—while extending the track record of bandwidth performance set by what was originally dubbed the “slow and wide” HBM memory architecture. Since bandwidth is the product of data rate and interface width, 6.4 Gb/s x 1024 enables 6,554 Gb/s. Dividing by 8 bits/byte yields a total bandwidth of 819 Gigabytes per second (GB/s).
HBM3 also supports 3D DRAM devices of up to 12-high stacks—with provision for a future extension to as high as 16 devices per stack—for device densities of up to 32Gb. In real-world terms, a 12-high stack of 32Gb devices translates to a single HBM3 DRAM device of 48GB capacity. Moreover, HBM3 doubles the number of memory channels to 16 and supports 32 virtual channels (with two pseudo-channels per channel). With more memory channels, HBM3 can support higher stacks of DRAM per device and finer access granularity.
Eliminating memory bandwidth bottlenecks
“HBM3 is all about bandwidth,” says Ferro. “There are many high-end accelerator cards going into the data center for AI [applications], particularly AI training. A lot of these systems have a good [number] of processors—but you’ve got to keep these processors fed [which means] memory bandwidth is now the bottleneck.”
To highlight IP requirements and potential design choices for the next generation of HBM3-based silicon, Ferro sketches a generic AI accelerator model with purpose-built processors running a neural network.
“You’ve got a processor—probably multiple processors—and these must get fed from memory. So, when you’re doing for example, image recognition training, you’ve got to put lots of data into the system [to enable high-accuracy inference],” he elaborates. “Clearly, you need a lot of memory bandwidth and that’s really where HBM3 comes into the picture. Although HBM2 and HBM2E [offer] very high bandwidth, processors still need to get fed with [even] more data.”
According to Ferro, memory is currently one of the most the critical bottlenecks in the data center, especially for AI/ML applications.
“If you look at the data sets for AI, they’re just growing at exponential rates,” says Ferro. “Data increases from month-to-month and puts a lot of pressure on the memory side.”
Balancing price, performance, and power
As Ferro points out, requirements for specific workloads—such as image processing, financial modeling, and pharmaceutical simulations—play a major role in influencing the design of AI accelerators.
“In the picture above, I’m showing two HBM3 memory devices, a configuration that will provide 1.6 terabytes of performance. If you’re doing genome sequencing or financial transactions, you may need more—or less—bandwidth [depending on workload],” he explains. “So, you might add two more HBMs to double that bandwidth even further. We’ve even seen systems that go up to eight HBMs. The basic architecture [remains] the same, although you’re tuning the system from an optimization standpoint.”
Additional design considerations include power and cost. As Ferro points out, HBM3 improves energy efficiency by dropping operating voltage to 1.1V and leveraging low-swing 0.4V signaling.
“You’re going to want to tune and balance the system to efficiently meet application [requirements] while staying within your cost and power budgets,” he adds.
To effectively determine tradeoffs that balance price, performance, and power, Ferro recommends that system designers first gauge memory processing requirements and then select an optimal implementation. For example, if only a terabyte of performance is needed, perhaps a single HBM2E memory device will suffice. If the application demands more bandwidth, multiple HBM3 devices will likely be a better fit.
PCIe 6 and chiplets
As Ferro notes, PCIe will also play a major role in influencing future AI accelerator designs. Indeed, PCIe 5 offers a transfer rate of 32 Giga transfers per second (GT/s) per pin (per second), while PCIe 6 will double this rate to 64 GT/s.
“You’ve got to look at how much data you will be bringing into the system, how much data you’re bringing out, and how these processors need to get fed,” he elaborates. “For example, you can [potentially] partition some [workloads] dynamically, so if you decide to split it into multiple jobs—because a lot of this is happening in parallel—maybe you don’t [need to] use all of that bandwidth [or] processing power [for a single task], although you can do multiple things at once.”
According to Ferro, minimizing die size is also an important consideration, especially for HBM implementations. This is one reason the semiconductor industry is eyeing chiplets for AI accelerators, as the technology enables system designers to mix and match different components based on specific workload requirements, shrink overall die size, and reduce costs.
“[With chiplets], you can potentially go with a cheaper process node for the I/O controller, for example, but if you need the most advanced process node for your processor, you can [do so while] balancing overall system cost,” he adds.