Based on my understanding from the description, it is ~8x faster (250 GFLOPS) fo...

jandrese · on Sept 13, 2024

That's alright, but not mindblowing. How does it compare to doing the same work on a GPU? Is there a particular set of tasks that GPUs struggle with that would be well suited for this? Or is this more a fig leaf over lousy GPU compute support in Apple land?

huijzer · on Sept 13, 2024

60 times faster could mean 2 minutes instead of 2 hours, or 2 seconds instead of 2 minutes. How is that not mind blowing, or at least very useful (for specific uses)?

jandrese · on Sept 13, 2024

Compared to 600 or 6000 times faster on a GPU though?

astrange · on Sept 14, 2024

M* CPUs aren't made for maximum performance, but for maximum power/performance tradeoffs, since they're mostly used in portables.

bee_rider · on Sept 13, 2024

Apple should mostly care about power-efficient inference I think, right? Not training. Spinning up a GPU seems like something to avoid.

I mean, I wonder how this thing compares to a gemm using all the cores in a cpu cluster. They might be ok with not even meeting that performance, if the accelerator can not hog all the cores and power.

At least that’s what my uninformed gut says. The workload for these things is like: little AI enhancements inside conventional apps, I think.

lxgr · on Sept 13, 2024

> Spinning up a GPU seems like something to avoid.

You can do inference on GPUs as well, and for anything other than very small/lightweight models, such as noise cancellation or maybe speech recognition, it's probably worth the initial overhead.

I believe CoreML already splits workloads between CPU, NPU, and GPU as appropriate.

jhugo · on Sept 14, 2024

It’s likely not worth the additional energy usage though, at least when running on battery.

bee_rider · on Sept 14, 2024

Yeah, this is what I was getting at. In some sense, the list of “capabilities which don’t require spinning up the GPU” is expanded. Whether something could be done by spinning up the GPU is beside the point.

TinkersW · on Sept 13, 2024

What? The article says this thing does 2005 GFOPs, aka 2 TFLOPS, which is decent, but we have had CPUs that could do more than this for a long time now. My Zen2 12 core does about 3 TFLOPs, and a modern 16 core Zen5 can do 8-10 TFLOPS(I'm unsure what clock speed it can maintain with all cores engaged). And that is generally purpose SIMD not specialized matrix stuff(less generally useful).

Apple CPU's kinda suck at vector ops, but they aren't that bad, this thing is only mildly better. I would guess power savings is a big part of why they use this SVE streaming matrix mode.

mmoskal · on Sept 14, 2024

IIUC this the cpu in an iPad. The pro/max versions would be more appropriate to compare against the Zen when they are released.