Optimizing Memory for AI Applications: Part 1

Steven Woo, Rambus fellow and distinguished inventor, recently spoke with Ed Sperling of Semiconductor Engineering about how certain number formats can help system designers optimize memory bandwidth for artificial intelligence (AI) applications. As Woo observes, such optimization is necessary because the semiconductor industry is constantly struggling to keep up with an exponential increase of digital data.

“The amount of digital data we’re generating is growing dramatically – really far faster than almost any other technology can keep up. It’s become very difficult to keep all that data on chip and so external memory is [often used],” Woo explains.

“However, system designers would like to get even more bandwidth, although memory technology can only progress so quickly. This is why [a number of] interesting techniques have been developed to try and compensate for the difference between what the processing engines want and what the industry is able to provide.”

One example of an early optimization technique, says Woo, leveraged a 16-bit number format in an attempt to extend the capabilities of available memory bandwidth.

“Some years ago, the IEEE defined a popular format known as 32-bit floating-point numbers. These 32- bits are split up into three components: a sign bit, 8 bits for exponents which describes a range that the numbers cover, and a 23-bit mantissa which is the fractional part of the number,” he elaborates. “When you have a fixed amount of memory bandwidth, instead of using large numbers that are 32-bits long, you can actually go to smaller size numbers [such as] 16-bit numbers. This has the effect of allowing you to pack twice as many numbers into the available bandwidth.”

Although 16-bit floating point numbers still have a single sign bit, the number of bits in the exponent and the mantissa (the fractional part of the number) is reduced.

“This means we’re less precise with our numbers – and we can cover a smaller range compared to 32-bit floating-point numbers,” he adds. “[However], what people began to realize, especially when it came to AI applications, was that the range you needed to cover was very important. As well, it was [slightly] less important to have exactly the type of precision that was available in 32-bit floating-point numbers.”

Consequently, a new type of numeric format known as bfloat16 (Brain Floating Point) was developed. Essentially, Bfloat16 rebalances the 16 bits.

“You have one sign bit, but to match the range of 32-bit floating-point numbers, we went back to 8-bit of exponents. This means the fractional part of the number (the mantissa) is now down to seven bits,” Woo continues. “Bfloat16 [represents] a nice compromise between trying to get you the range that you really need – for many types of applications – while not compromising the fractional precision too much in order to make the networks not converge.”

As Woo points out, system designers have found that neural networks are extremely tolerant of small changes in precision.

“Rebalancing of the bits has actually been helpful in terms of allowing the memory bandwidth to be used more effectively while not compromising on the performance of neural networks,” he states.

Woo also notes that very large neural networks – across multiple and diverse verticals – tend to require optimized precision and range.

“[Similarly], when you’re training, you’d like to have very high precision. You might actually train using a much higher precision set of numbers,” he explains. “When you go to inference you might actually reduce down to smaller size numbers so that you don’t need to burn as much power or need to spend as much memory bandwidth to get the kind of performance that you’re training for.”

In addition, Woo says that various types of inputs can leverage different dynamic ranges and number types.

“[As well], even the training process itself and the inference process can use different precision numbers at different parts of the computation. For example, you might do accumulations at a much higher resolution – maybe 32-bits – while you might do the multiply at something smaller like 16 bits,” he adds.

Optimizing Memory for AI Applications: Part 1

Company

Products

Markets

Resources

Reader Interactions

Leave a Reply Cancel reply

Footer

Company

Products

Markets

Resources