The scaling of Large Language Models (LLMs) is increasingly constrained by memory communication overhead between High-Bandwidth Memory (HBM) and SRAM. Specifically, the Key-Value (KV) cache size ...
Abstract: Modern processors use caches to reduce memory access time. However, their limited size leads to frequent misses, requiring an efficient replacement policy. The Least Recently Used (LRU) ...