Multi-thread computing is driving up memory bandwidth, but needs smaller access granularity due to the random nature of the data accesses. However, small transfers of data are becoming increasingly difficult with each DRAM generation. Although the memory interface has become faster, the frequency of the main memory core has remained relatively the same. As a result, DRAMs implement core prefetch where a larger amount of data is sensed from the memory core and then serialized to a faster off-chip interface, effectively increasing the access granularity. This discrepancy between the interface speed and the core speed translates to a core-prefetch ratio of 8:1 in current DDR3 DRAMs, and is forecasted to reach 16:1 in future DRAM. This larger prefetch ratio and transfer size can lead to computing inefficiency, especially on multi-threaded and graphics workloads with the need for increased access rate to smaller pieces of data.
The memory subsystem in today’s computing platforms are typically implemented with DIMMs that have a 64bit-wide data bus and a 28bit command/address/clock bus. On a standard DDR3 DIMM module, all the devices within a module rank are accessed simultaneously with a single Command/Address (C/A). An example module configuration places eight (x8) DDR3 components assembled in parallel onto a module printed circuit board and has a minimum efficient data transfer of 64Bytes.
Greater efficiency with a multi-threaded workload and smaller transfers can be achieved by partitioning the module into two separate memory channels, and multiplexing the commands across the same set of traces as a traditional module but with separate chip selects for each respective memory channel. In threaded module, each side of the module is accessed independently, thereby reducing the minimum transfer size to one-half the amount of a standard single-channel module.
Threaded modules can lower the power of main memory accesses. For a conventional eight-device module, all eight DRAMs are activated (ACT) followed by a read or write (COL) operation on all eight devices. A threaded or dual-channel module can accomplish the same data transfer by activating only four devices and then performing two consecutive read or write operations to those devices. Since only four devices are activated per access instead of eight devices, a threaded or dual-channel module achieves equivalent bandwidth with one-half the device row activation power. On a memory system, this translates to approximately 20% reduced total module power.