Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How much do these silu and softmax improvements affect the LLM inference speed as a whole? Correct me if I'm wrong but I feel that this change will only have a small effect as the majority of the time is spent doing matrix multiplications.


Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.


If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.


Why does it disproportionately use bandwidth?


In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.


Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?


Wouldn’t the softmax typically be “fused” with the matmul though?


Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: