FPGAs take on convolutional neural networks

This entry was posted on Monday, May 8th, 2017.

In the context of machine learning, a convolutional neural network (CNN, or ConvNet) can perhaps best be defined as a category of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. According to Stanford staff, convolutional Neural Networks are quite similar to ordinary neural networks, as they are comprised of neurons that have learnable weights and biases.

neuralnet

“Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity,” Stanford teaching staff stated in a series of notes posted to GitHub. “The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. [However], ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.”

As Nicole Hemsoth of The Next Platform reports, Dr. Peter Milder of Stony Brook University and his team have developed an FPGA based architecture, dubbed Escher, to tackle convolutional neural networks in a way that minimizes redundant data movement and maximizes the on-chip buffers and innate flexibility of an FPGA to bolster inference.

“The big problem is that when you compute a large, modern CNN and are doing inference, you have to bring in a lot of weights—all these pre-trained parameters. They are often hundreds of megabytes each, so you can’t store them on chip—it has to be in off-chip DRAM,” Milder told The Next Platform. “In image recognition, you have to bring that data in by reading 400-500 MB just to get the weights and answer, then move on to the next image and read those same hundreds of megabytes again, which is an obvious inefficiency.”

The goal, says Milder, is to create an architecture that is flexible enough for whatever layer users want to compute with.

“What we did with Escher was to produce an accelerator for CNN layers that is flexible enough to work on fully connected and the convolutional layers themselves—and can have batching applied to all of them without the overhead.”

According to Miller, the current interest in FPGAs is staggering.

“Just a few years ago, there would only be a few people working on these problems and presenting at conferences, but now there are many sessions on topics like this. People are seriously looking at FPGA deployments at scale now,” he explained. “The infrastructure is in place for a lot of work to keep scaling this up. The raw parallelism with thousands of arithmetic units that can work in parallel, connected to relatively fast on-chip memory to feed them data means the potential for a lot of energy efficient compute at high performance values for deep learning and other workloads.”

Commenting on the above, Steven Woo, Rambus VP of Systems and Solutions, in the Office of the CTO, told Rambus Press that hardware acceleration using FPGAs and specialized silicon for applications like machine learning is on the rise, as they offer tremendous improvements in performance and power efficiency.

“Moving data between processing engines and off-chip memory significantly impacts power and performance, and remains a bottleneck for many applications. Batching and tiling techniques like those used in Escher and Google’s TPU help to reduce the impact of data movement, but as problem sizes and network sizes grow, data movement will continue to be a critical issue,” he added.