Narrow roads versus silicon superhighways
Steven Woo, Rambus fellow and distinguished inventor, recently spoke with Ed Sperling of Semiconductor Engineering about the capabilities of HBM2 and GDDR6 memory in real-world scenarios. As Woo notes, choosing between HBM2 and GDDR6 is a complex design decision that requires an in-depth understanding of system and application requirements. For example, certain applications need a significant amount of bandwidth, while others demand very low latency. To illustrate the differences between the two memory types, as well as how they compare to lower bandwidth memories like DDR and LPDDR, Woo uses the analogy of a car driving on a road.
“[Imagine] you are traveling on the road from your home. Let’s say you wanted to travel to the next city. The amount of time it takes you – the latency to travel to the next city – depends on how much traffic there is on the road and how much traffic each of those roads can support,” he explains.
“So, I have two choices: I can take a narrow road or a wide superhighway. Well, in the dead of night when there’s no traffic it’ll take me about the same amount of time to get from point A to point B because there’s nothing really stopping me. So, the amount of time it takes me will be roughly the same.”
However, says Woo, this obviously isn’t the case when there is traffic.
“If I’m by myself on the road, I can take whatever road is closer to my home and that’ll be the fastest route for me to get from point A to point B. But when there is a lot traffic it’s a different story. A road that can’t support a lot of traffic has low bandwidth. It is more problematic and takes me a lot longer during rush hour to get from point A to point B.”
As Woo emphasizes, memory systems are really no different than the roads described above. Indeed, from a silicon perspective, traffic could be originating from multiple sources.
“Different cars represent accesses to memory from the various [items] that could be on the SoC or in the system. It could be graphics, it could be I/O, or the central processor (CPU) itself. However, sometimes the application needs very little latency because you must wait until you get that data. And if you’re stuck behind a lot of traffic your wait time can go up,” he elaborates. “But contrast that with something that’s much higher bandwidth, something that supports a lot more traffic. If we had those same cars and trucks, it would be much easier to have a fast path to get from point A to point B because you can support much more traffic on this road.”
HBM2 & GDDR6
The latter example, says Woo, is analogous to what newer memories such as HBM2 and GDDR6 enable.
“You can get from point A to point B much faster when you have a lot of load. So, applications like graphics and artificial intelligence (AI) – they have a lot of traffic that they’re trying to support. This means the latency tends to stay low even under the heaviest amounts of traffic.”
Continuing the road traffic analogy, Woo discusses the speed per lane for applications such as graphics and AI.
“The speed per lane [is typically] the same, no matter which one you’re using. However, sometimes interference occurs. This can be thought of as drivers on the highway who like to change lanes from time to time,” he states. “The more people that are frequently changing lanes, the more people can slow you down. Some analogous things happen in memory systems where the various requests that originate from graphics and the CPU and I/O might interfere with each other. So, if you don’t have enough lanes it can be hard for you to route your work quickly through the system.”
On a single or double-lane road, a car that swerves back and forth blocks all the traffic behind it. However, a multi-lane highway can better accommodate a car or truck that is switching between one or more lanes.
“Of course, it is more difficult to build the bigger highway than the smaller one, as it takes more room and consumes more resources,” Woo explains. “Nevertheless, there are benefits depending on workload. So, if you have a workload where you absolutely need the lowest latency and there isn’t much traffic on the system, you can build a narrower system – something like what you would see with DDR or LPDDR.”
AI & graphics
However, applications and tasks like AI or graphics rendering are bandwidth intensive and require a significantly wider system to support an increased amount of traffic.
“You can dedicate lanes and some [designers] do just that for their systems. They may take some lanes and some resources in the memory system and say these things are reserved for CPU traffic and that would be analogous to a carpool lane on a highway,” says Woo. “[Regarding] AI chips, you’re trying to get massive throughput; you’ve got multiple processors and you’ve got multiple onboard memories on chip, but you also have lots of external I/Os into different kinds of memory.”
To be sure, high-performance memory like GDDR or HBM2 can be coupled with lower-bandwidth memory such as conventional DDR because it is capable of supporting different types of traffic.
“Very high-performance traffic [is generated by] AI systems where the user is moving a lot of weights and a lot of training samples back and forth to the engine,” he elaborates. “You tend to move those in bulk very quickly and you’ll tend to use the high-performance memory. With the rest of the system, you tend to use things like more traditional DDR where you don’t need as high a performance.”
The multi-tiered highways of HBM
Woo also describes stacked HBM2 as a “multi-tiered highway” where drivers can choose which level of the road to drive on. Put simply, stacked HBM bolsters system resources and supports higher amounts of traffic. Nevertheless, it can be challenging for designers to effectively implement a system with HBM2 memory.
“[For example], yield is more difficult to achieve. Moreover, you must deal with the thermals, as heat gets trapped between the layers and you must find a way to pull it out. Of course, it’s also more challenging to try and design a multi-tiered highway, but if you can do it and if your system can tolerate all the environmental things like heat and shock and things like that, you can definitely get much better performance,” he adds.
This means designers need to think about what their applications will require.
Latency under load
“There’s a certain level of bandwidth and latency you’re going to need under that load. This concept [is known as] latency under load and you study this along with your applications to understand how to build the memory system,” states Woo. “It’s very easy to think about what the latency of a memory device looks like when there’s no load, or when there’s only a single request in the system.”
However, says Woo, real-world applications, especially demanding ones like graphics and AI, have lots of requests happening simultaneously.
“What you’re designing for is the bandwidth, say how many cars you can move through in a given amount of time – or how many requests you can support in a given amount of time. Also, what the latency under load is, because those are the things that really matter.”
Moreover, says Woo, processor choice is also an important factor in deciding on a memory system type.
“Certain processors can tolerate slightly higher differences in latency better than others. So, as you send a request down to the memories and you wait for the data to come back, if that processor has other work it can tolerate the latency,” he explains. “There are differences in latency and your processor architecture tries to make up for that – and tries to make sure there’s enough work to do in the meantime.”
Moore’s Law and the evolution of system design
Commenting on Moore’s Law and the evolution of system design, Woo notes that electrical distances haven’t really changed over time. Meaning, the speed of light is relatively fixed – and the distance memory is placed away from the process remains approximately the same. However, bandwidth can be upgraded and expanded by adding more lanes.
“Historically, that’s exactly what we’ve done as an industry. We’ve come up with new memories that add more lanes and resources. By doing this, we [evolved] from a purely latency driven set of applications to applications that also demand bandwidth to compensate for the fact that latencies can’t get much lower than they really are.”
Rambus, says Woo, has designed solutions for both HBM2 and GDDR6 memory systems, as the company sees a tremendous need for both memory types across a diverse range of applications and verticals.
“HBM is a great choice when you need the highest bandwidth and best power efficiency. [However], it’s much tougher to design with from an engineering standpoint – and it does take a little bit more cost to implement. So, if you don’t have the design experience or can’t tolerate the cost, GDDR6 is another great solution. It’s kind of a compromise, but still delivers great performance and a very wide pathway into memory to support high bandwidth,” he concludes.