In the two cases you have specified here, #2 is almost always the winner on perf...

In the two cases you have specified here, #2 is almost always the winner on performance. So much so that in performance-sensitive code, many people will (justifiably) default to it without a benchmark. Computers almost always operate in a memory bandwidth bound state, and have comparatively idle cores, and #1 is likely to just be wasteful of the resource that will almost always be the binding constraint.

Examples are ECS systems in games, and async run-to-completion runtimes on servers. HPC systems also tend to operate this way.

Also, in the interest of disclosure, I wrote the blog post you are responding to.