Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Based on my understanding from the description, it is ~8x faster (250 GFLOPS) for vector ops (vs. SVE mode at 31 GFLOPS which is CPU-ish) and 60-100 times faster (e.g. 2005 GFLOPS) for matrix multiplication for single-precision values.


That's alright, but not mindblowing. How does it compare to doing the same work on a GPU? Is there a particular set of tasks that GPUs struggle with that would be well suited for this? Or is this more a fig leaf over lousy GPU compute support in Apple land?


60 times faster could mean 2 minutes instead of 2 hours, or 2 seconds instead of 2 minutes. How is that not mind blowing, or at least very useful (for specific uses)?


Compared to 600 or 6000 times faster on a GPU though?


M* CPUs aren't made for maximum performance, but for maximum power/performance tradeoffs, since they're mostly used in portables.


Apple should mostly care about power-efficient inference I think, right? Not training. Spinning up a GPU seems like something to avoid.

I mean, I wonder how this thing compares to a gemm using all the cores in a cpu cluster. They might be ok with not even meeting that performance, if the accelerator can not hog all the cores and power.

At least that’s what my uninformed gut says. The workload for these things is like: little AI enhancements inside conventional apps, I think.


> Spinning up a GPU seems like something to avoid.

You can do inference on GPUs as well, and for anything other than very small/lightweight models, such as noise cancellation or maybe speech recognition, it's probably worth the initial overhead.

I believe CoreML already splits workloads between CPU, NPU, and GPU as appropriate.


It’s likely not worth the additional energy usage though, at least when running on battery.


Yeah, this is what I was getting at. In some sense, the list of “capabilities which don’t require spinning up the GPU” is expanded. Whether something could be done by spinning up the GPU is beside the point.


What? The article says this thing does 2005 GFOPs, aka 2 TFLOPS, which is decent, but we have had CPUs that could do more than this for a long time now. My Zen2 12 core does about 3 TFLOPs, and a modern 16 core Zen5 can do 8-10 TFLOPS(I'm unsure what clock speed it can maintain with all cores engaged). And that is generally purpose SIMD not specialized matrix stuff(less generally useful).

Apple CPU's kinda suck at vector ops, but they aren't that bad, this thing is only mildly better. I would guess power savings is a big part of why they use this SVE streaming matrix mode.


IIUC this the cpu in an iPad. The pro/max versions would be more appropriate to compare against the Zen when they are released.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: