How much do these silu and softmax improvements affect the LLM inference speed a...

terafo · on May 15, 2024

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

tehsauce · on May 15, 2024

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

cgearhart · on May 15, 2024

Why does it disproportionately use bandwidth?

jacobn · on May 16, 2024

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

cgearhart · on May 16, 2024

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

bjornsing · on May 16, 2024

Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2 · on May 16, 2024

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)