That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly...

vlovich123 · 2025-10-21T22:58:28 1761087508

There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.

In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.

pron · 2025-10-22T01:48:56 1761097736

An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).

vlovich123 · 2025-10-22T02:53:52 1761101632

An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.

skavi · 2025-10-23T09:59:03 1761213543

Is that best case latency? e.g., with only one thread adding to that location?

zipy124 · 2025-10-22T00:08:01 1761091681

What about using something like https://github.com/judofyr/spice?

vlovich123 · 2025-10-22T00:42:41 1761093761

These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.