There are a lot of matrix multiplication algorithms out there with a lot of plus...

emacs28 · on March 15, 2024

For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.

pclmulqdq · on March 15, 2024

"Conventional fixed-point MM" is a large suite of algorithms. It is correct that this is a 2x reduction in MULs compared to naive fixed-point matrix multiply, but there is a large body of literature out there with other circuits. This is a cool trick to add to the group.

p1esk · on March 15, 2024

Inference world is gradually switching from INT formats to FP formats. FP8 is already supported in modern hardware, and FP4 support is coming. In my experiments I get better perplexity in language models with FP4 than with INT4.

SJC_Hacker · on March 18, 2024

How is FP fundamentally different than integers? I've done FPGA programming and it just seems like the programmer has to decide where/when to do the shifting based on the expected range of the data. I'm not sure how this is "natively supported" in hardware.

p1esk · on March 18, 2024

If you have designed FPUs you should know that FP computation involves a lot more additional operations than just shifting (e.g. rounding, subnormals, and special value handling). That’s why, for example, CPUs use different hardware blocks for INT vs FP computation.

But that’s not the point. The point is, this particular method to speed up matmul is not suitable for FP.

abetusk · on March 15, 2024

I'm no expert but I suspect this is wrong. To me, this is like saying you don't need to worry about integer overflow because your operations are only working on fixed integers. Really? You don't care if you multiply or add two large numbers and they spill over?

The more appropriate answer, I suspect, is that the numerical precision and stability sacrifices are more than adequate for normal usage.

If I'm wrong about this, I would certainly like to know.

pclmulqdq · on March 15, 2024

In hardware, you control your integer widths completely, so if you add two 32-bit ints to a 33-bit int, there is no chance of overflow. The same goes for multiplications, etc.

SJC_Hacker · on March 18, 2024

Yeah with shifts you can guarantee no overflow, but you have to decide under what circumstances is avoiding overflow/clipping worth the loss of precision.

Fixed point typically requires alot more programming, but sometimes its worth it if you know what the data ranges are.

abetusk · on March 15, 2024

I don't know why this answer is getting downvoted. This is absolutely correct.

W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.

[0] https://epubs.siam.org/doi/10.1137/0204009

gyrovagueGeist · on March 15, 2024

Another relevant paper is: https://epubs.siam.org/doi/10.1137/15M1032168

waynecochran · on March 15, 2024

The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.

pclmulqdq · on March 15, 2024

The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.

p1esk · on March 15, 2024

For floating point? Are you sure?

waynecochran · on March 15, 2024

Opening statement of README

    This repository contains the source code for ML hardware architectures that 
    require nearly half the number of multiplier units to achieve the same 
    performance, by executing alternative inner-product algorithms that trade 
    nearly half the multiplications for cheap low-bitwidth additions, while still 
    producing identical output as the conventional inner product.

p1esk · on March 15, 2024

I just looked at the paper: the answer is no, floating point is not supported.