Overview

Most of the power consumption in current GPGPU-based systems comes from data transfer. This is due to their reliance on an asynchronous, best-effort bus master, many-core architecture, which has remained unchanged for the past 30 years. To tackle this problem, we have developed Lenzo Core based on a completely new architecture known as CGLA (Coarse Grained Linear Array).

The key features of CGLA:

  • A dataflow-oriented architecture that incorporates synchronous QoS control, a bus-slave design, and a coarse-grained reconfigurable structure, effectively minimizing data transfers between compute cores and memory.
  • A computing platform that addresses the challenges of large-scale SIMD and wide-memory-bus architectures (including GPGPUs), which are difficult to partition into chiplets. Instead, CGLA utilizes a multi-level pipeline and a tightly integrated configuration of narrow memory buses.
  • An architecture that overcomes the limitations of traditional Coarse Grained Reconfigurable Arrays (CGRA), which, while expected to be high-efficiency non-Von Neumann architecture, struggle with programming complexity and slow compilation speed.

Background

Chiplet-based accelerators are an effective way to reduce manufacturing energy consumption and costs by improving yield. However, when traditional computing architectures are divided and connected using high-speed serial interfaces, a significant increase in energy consumption during operation becomes a major issue. Chiplet-based design should not only aim to improve yield but also contribute to energy efficiency. Achieving this requires innovative and scalable ideas at the architectural level.

Currently dominant architectures like many-core CPUs and coalescing GPGPUs are based on the Von Neumann model. In these systems, compute cores act as bus masters, taking the initiative to retrieve data from main memory via cache using a best-effort approach. These architectures have historically adopted dynamic instruction execution to maintain program compatibility even at the expense of power efficiency. As a result, large-scale memory buses have become essential to their performance, making them not suitable for chiplet-based designs.

However, with a deep understanding of architectural hierarchies, it is possible to leverage the trade-off between program compatibility and power efficiency. For instance, by adopting a static instruction execution model for chiplet-friendly architectures as shown in the above figure—where compute cores act as slaves—new possibilities can emerge, inspired by past designs such as low-power VLIW and high-efficiency vector architectures. By retaining the programming model of Von Neumann computers while switching to a non-Von Neumann execution model, it becomes possible to overcome the traditional challenges of non-Von Neumann systems—such as complex programming and long compilation times—while achieving dramatically improved power efficiency.

Competitive Advantage

Although there are many low-power AI chips available in market, most focus on designs that combine specific classical AI algorithms with low-precision computation. As a result, they often become obsolete within just a few months. Currently, the shift from deep convolutional neural networks to transformer-based large language models is underway, revealing a critical issue: low-precision computation is poorly suited for large-scale AI.

Emerging technologies—such as photonics, quantum computing, reservoir computing, memristors, in-memory computing, and novel materials—often lack the flexibility and scalability needed to accommodate evolving algorithms and precision demands. As a result, GPGPUs, which offer high-precision, general-purpose, and large-scale computing capabilities, remain widely utilized.

From the end-user’s perspective, it is crucial that a computing platform is capable of executing algorithms originally developed for GPGPUs without compromising computational precision. Moreover, because power-efficient platforms often entail increased programming complexity, achieving programmability on par with GPGPUs is also essential.

The figure above presents the broader landscape of AI chip architectures and highlights where CGLA fits within it. The decision flow, starting from the left, considers factors such as programmability, compilation speed, architectural model (Von Neumann vs. non-Von Neumann), and memory interface, ultimately leading to the major architectures shown on the right. CGLA stands out with its distinctive combination of rapid compilation, non-Von Neumann architecture, and a ring-based structure integrated with cache memory. Notably, LAPP and EMAX serve as predecessors to CGLA.