FlashAttention Part Three: Understanding the GPU memory hierarchy and how each component can be used to optimize performance.

Dimitris Poulopoulos

May 8th

GPU Memory Hierarchy
Faster than your company’s organization chart!

In our last chapter, we delved into the attention mechanism—today's superstar in the world of Deep Learning. We now have a basic understanding of how attention works, so, before we explore the various types of attention mechanisms, let's circle back to this month's topic: FlashAttention.

However, in this story we won’t be talking about FlashAttention directly. Instead, we take a detour to understand a bit about the GPU architecture and, specifically, the memory hierarchy of the device. Solidifying this theoretical framework is crucial for getting to grips with FlashAttention.

When you hear “hierarchy”, you might think of the army's rigid structure or the complex layers of a multinational company. However, when it comes to GPUs, we're dealing with a different kind of hierarchy altogether. Like most system devices, GPUs have a memory hierarchy determined by the distance between the memory chip and the computation unit—what we call cores in modern CPUs, or, in the GPU case, the SM (Streaming Multiprocessor).

Typically, the further you are from the computation unit, the slower the access speeds. On the other hand, the memory capacity increases and the cost is lower. In this chapter, we'll dissect the GPU memory hierarchy to show why understanding this is crucial for mastering FlashAttention. I’m certain that some hints in the end will get you to wonder: why has no one ever thought of FlashAttention before?

Let’s begin!

As we touched on in the prologue, GPUs—much like other systems—feature a memory hierarchy. We’ll start our exploration with the type of storage that sits closest to the execution unit. First stop on this silicon road-trip: the registers!

Local Storage

Each execution thread operates within its own logically isolated local storage space, where the most critical resource is the registers. These registers, embedded within the GPU, are high-speed storage units that act as both the starting point and the destination for nearly all low-level machine instructions. Essentially, when the GPU crunches numbers, it pulls its input from these registers and sends the output right back to them.

Registers are the fastest memory resource, but you, as a programmer, typically have no say on how they are used during execution; they are largely managed by the compiler. So, there’s not much optimization we can do at this level. Let’s move to something we can influence, and should take advantage of.

Shared Memory (SRAM) / L1 Cache

The shared memory and the L1 cache are both integral parts of the GPU die and do not exist off-chip. Let’s start with the things you can control!

Shared memory is a local memory array, directly accessible and explicitly allocable by programmers for their temporary data stash needs. Think of it as a box that can hold up to 48KB per thread block (maybe a bit more in modern architectures), and you can use it to deposit small items that you can’t hold in your hands, but you also don’t want to move to the attic, because you will need it in a minute or two.

Shared memory is distinct in that it is a per SM (Streaming Multiprocessor) resource and possesses a dual nature:

It facilitates inter-thread communication within a block: Multiple workers (threads) may be working in a factory (SM). Everyone is tasked with doing the exact same job, using different resources (input data). However, each worker can share their work (output data) by placing the output of their effort in a common chest (shared memory), where other workers can pick it up.
It aims to reduce redundant global memory accesses and optimize corresponding data access patterns: Imagine you store every small item, even your forks and spoons, in the attic. Now, every time you’d like to eat, you’d have to open the attic door, climb up the ladder, open the box labeled “cattlery”, pick up a fork, go down, eat your meal, wash your fork, and follow the same process to store it back in the attic. It’s like a workout you didn’t sign up for. Since you will need the fork again in a few hours, it’s better to put it in the kitchen drawer. That’s what programmers do with shared memory. They use it to store data they want to use in the next operation.

On the other hand, life gets a bit easier for programmers when dealing with the L1 cache, and that’s because the L1 cache is on auto-pilot! Unlike the hands-on approach needed for shared memory, the L1 cache operates behind the scenes. It automatically keeps a hold of recently used data, in the hope that it will be reused soon. It’s like having a smart assistant who remembers to keep your files on the desk because you’ll probably ask for them in a minute.

L2 Cache

The L2 cache is like the L1 cache’s bigger sibling with a broader responsibility. While the L1 cache sticks close to its home base in each SM (Streaming Multiprocessor), the L2 cache has the run of the entire device.

It's the central hub of data traffic—think of it as the main bus station in a city. All data must pass through this station before heading back to the vast countryside of global memory. This device-wide role ensures that if any piece of data needs to travel, it first checks in with the L2 cache, making sure nothing is needlessly repeated across the system.

Global Memory (HBM)

Global memory is like the distant warehouse of the GPU world—it's where everything starts when you're gearing up for computation. It is the first stop for any CUDA programmer needing to transfer data from the system memory to the GPU's processing powerhouse. Since it's accessible by all threads and the host, it's the communal pool everyone dips into.

However, it's not without its drawbacks. Global memory has relatively high latency, meaning it’s a bit slow on the draw. Requesting data from this memory is like ordering a package online; you click 'order' and then wait patiently (or not so patiently) for it to arrive ready for use. This can be a snag for memory-bound algorithms that are as impatient as a kid on Christmas morning. However, compared to the system's memory, it’s really fast. But here's the silver lining: compared to the system's memory, global memory is still pretty speedy. It’s like getting a package from within your own country, as opposed to system memory's international shipping scenario, where your order has to clear customs.

Additionally, while global memory boasts higher throughput than your average system memory, it still moves like a snail compared to other memory types within the GPU. It's like having a fast car in a world where everyone else is driving supercars.

Why Should You Care?

Armed with this knowledge, you might be wondering, "Why should I care about all this if I just want to understand how FlashAttention works?" Great question! All the concepts we've discussed form the very core of FlashAttention. The distinctions between shared memory and global memory aren't just technical details; they're fundamental to grasping why FlashAttention is so much faster than the vanilla attention algorithm. If you've been following along, you're probably starting to see the bigger picture.

If you still don’t, I’m sure the following picture of how the vanilla attention algorithm flows will make everything clear:

Notice how each operation sequence starts with loading data from HBM and ends with the phrase “write back to HBM”? This loop is a significant bottleneck. As we discussed before, loading data from HBM is similar to ordering items online. Similarly, writing back to HBM can be compared to sending items back using the post office. Just as you might dread the delays and potential hassles of shipping items back and forth, the GPU faces efficiency challenges with these repetitive trips to and from the HBM.

So, what can we do? This will be the focus of our next—and final—story on FlashAttention. But if you’ve been paying close attention, you might already have a hint of what’s coming!

$5.00

Learning Rate

Thank you for your support! It keeps the doors open and the wheels rolling.
Photo by Rock'n Roll Monkey on Unsplash

Send a tip

113 Cherry St #92768, Seattle, WA 98104-2205
Unsubscribe · Preferences

Learning Rate

⛽ How GPU Memory Hierarchy Fuels the idea Behind FlashAttention

Dimitris Poulopoulos

GPU Memory Hierarchy
Faster than your company’s organization chart!

Local Storage

Shared Memory (SRAM) / L1 Cache

L2 Cache

Global Memory (HBM)

Why Should You Care?

Learning Rate

👀️ Attention, Please! Attention is all you need, but the span is limited

⚡️ Unraveling FlashAttention: A Leap Forward in Language Modeling

🚀️ Learning Rate Evolves: Dive Deeper into ML with Us!

Learning Rate

⛽ How GPU Memory Hierarchy Fuels the idea Behind FlashAttention

Dimitris Poulopoulos

GPU Memory Hierarchy​Faster than your company’s organization chart!

Local Storage

Shared Memory (SRAM) / L1 Cache

L2 Cache

Global Memory (HBM)

Why Should You Care?

Learning Rate

Learning Rate

👀️ Attention, Please! Attention is all you need, but the span is limited

⚡️ Unraveling FlashAttention: A Leap Forward in Language Modeling

🚀️ Learning Rate Evolves: Dive Deeper into ML with Us!

GPU Memory Hierarchy
Faster than your company’s organization chart!