That's fine, but a work-stealing scheduler doesn't redistribute work willy-nilly. Locally-submitted tasks are likely to remain local, and are generally stolen when stealing does pay off. If everything is more-or-less evenly distributed, you'll get little or no stealing.
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.
There’s secondary costs though - because you might run on any thread you have to sprinkle atomics and/or mutexes all over the place (in Rust parlance the tasks spawned must be Send) which have all sorts of implicit performance costs that stack up even if you never transfer the task.
In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.
An occasional successful CAS (on an owned cache line) has very little cost, but if you have to sprinkle atomics/mutexes all over the place, then there's something that's clearly not scalable in your design regardless of the concurrency implementation (you're expecting contention in a lot of places).
An atomic add on a 6ghz high end desktop CPU (13900) is I believe on the order of 4-10ns. If it’s in your hot path your hot path can’t go faster than 50-100 million operations/s - that’s the cost of 1 such instruction in your hotpath (down from the 24 billion non-atomic additions your 6ghz could do otherwise). A CAS brings this down to ~20-50 Mops/s. So it’s quite a meaningful slowdown if you actually want to use the full throughput of your CPU. And if that cache line is cached on another CPU you pay an additional hidden latency that could be anywhere from 40-200ns further reducing your hotpath to a maximum of 5-25MHz (and ignoring secondary effects of slowing down those cores without them even doing anything). God forbid there’s any contention - you’re looking at a variance of 20x between the optimal and worst case of how much of a throughput reduction you see by having a single CAS in your hot loop. And this is just talking about the task scheduler - at least in Rust you’ll need to have thread-safe data structures being accessed within the task itself - that’s what I was referring to as “sprinkled”. If you really want to target something running at 10Mops/s on a single core, I don’t think you can possibly get there with a task stealing approach.
These aren’t task queues as are being discussed here. It’s more like rayon - I have a par_iter and I want that to go as fast as possible on a large number of elements. Slightly different use case than thread per core vs work stealing runtime.
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.