Based on my understanding from the description, it is ~8x faster (250 GFLOPS) for vector ops (vs. SVE mode at 31 GFLOPS which is CPU-ish) and 60-100 times faster (e.g. 2005 GFLOPS) for matrix multiplication for single-precision values.
That's alright, but not mindblowing. How does it compare to doing the same work on a GPU? Is there a particular set of tasks that GPUs struggle with that would be well suited for this? Or is this more a fig leaf over lousy GPU compute support in Apple land?
60 times faster could mean 2 minutes instead of 2 hours, or 2 seconds instead of 2 minutes. How is that not mind blowing, or at least very useful (for specific uses)?
Apple should mostly care about power-efficient inference I think, right? Not training. Spinning up a GPU seems like something to avoid.
I mean, I wonder how this thing compares to a gemm using all the cores in a cpu cluster. They might be ok with not even meeting that performance, if the accelerator can not hog all the cores and power.
At least that’s what my uninformed gut says. The workload for these things is like: little AI enhancements inside conventional apps, I think.
> Spinning up a GPU seems like something to avoid.
You can do inference on GPUs as well, and for anything other than very small/lightweight models, such as noise cancellation or maybe speech recognition, it's probably worth the initial overhead.
I believe CoreML already splits workloads between CPU, NPU, and GPU as appropriate.
Yeah, this is what I was getting at. In some sense, the list of “capabilities which don’t require spinning up the GPU” is expanded. Whether something could be done by spinning up the GPU is beside the point.
What? The article says this thing does 2005 GFOPs, aka 2 TFLOPS, which is decent, but we have had CPUs that could do more than this for a long time now. My Zen2 12 core does about 3 TFLOPs, and a modern 16 core Zen5 can do 8-10 TFLOPS(I'm unsure what clock speed it can maintain with all cores engaged). And that is generally purpose SIMD not specialized matrix stuff(less generally useful).
Apple CPU's kinda suck at vector ops, but they aren't that bad, this thing is only mildly better. I would guess power savings is a big part of why they use this SVE streaming matrix mode.