As the demand on high computing performance for artificial intelligence is increasing, parallel processing accelerators such as GPU and TPU are the key to determining system performance. In this trend, the number of cores in accelerators is continuously increasing for performance scaling, which requires more off-chip memory bandwidth and area for increased cores. As a result, it not only increases the energy consumed by interconnection, but also limits system performance by insufficient off-chip memory bandwidth. In order to overcome the limitation, Processing-In-Memory (PIM) architecture is re-emerged. PIM architecture is the integration of processing units with memory, which can be implemented by 3D-stack memory such as high bandwidth memory (HBM) without disadvantages of differences from memory process and processor process.
Our lab’s AI hardware group focused on the optimized design of PIM architecture and interconnection for 3D stacked PIM-HBM considering signal integrity (SI) / power integrity (PI). In order to provide high memory bandwidth to the PIM core using through silicon via (TSV), data rates or the number of TSV should be increased. However, more than 30% of DRAM area is already occupied by TSV, and data rates of TSV is determined by channel performance. Therefore, optimal design of TSV channel and placement considering SI should be essential for small area and high bandwidth. In addition, appropriate PIM cores must be embedded in the HBM logic die for system performance improvement. As the number of PIM cores increases for high performance, more area of logic die is required. Memory bandwidth for host processor is decreased by increased interposer channel length. Consequently, design of PIM-HBM logic die and interposer channel must be optimized for system performance without degradation of interposer bandwidth for host processor.
Through system level optimization as mentioned above, our PIM-HBM architecture can achieve high energy-efficiency by drastically reducing interconnection lengths and improve system performance in memory-limited applications by increasing internal TSV bandwidth.