It's still not trivial even with massive parallelism. Performance is easily kill...

It's still not trivial even with massive parallelism. Performance is easily killed if you:

1) Completely saturate the memory interconnect between the processors and RAM chips

2) If your parallel collector threads are touching the same memory that your worker threads are, you're going to have some cache contention

3) One of the big challenges with cache lines that have writes in them is if they are shared between processors. For example, create two ref cells and have two parallel threads set them -- if the compiler did not happen to pad them out to cache line size, they'll be fighting for the same 64 bytes (8 64-bit words/pointers) of memory

We (Manticore) have done a lot of work to isolate the per-CPU work, heap pages, etc. (http://arxiv.org/abs/1105.2554 , MSPC 2011 proceedings should be out on acm.org soon-ish). But it's not easy and requires rearchitecting your whole compiler and runtime system. GHC is also in the middle of some similar work, and they have a much harder problem both due to the challenges of implementing laziness and they have a full-featured system and real user base, so they can't just change things willy-nilly.

Further, once you get the parallelism right, you end up in a different situation: Amdahl's Law bites. Even if 90% of your program is amenable to parallel work, if you make that portion go infinitely fast, you still have a sequential portion that takes 10% of the time and therefore limits you to a maximum of 10x speedup.

We are hitting that limitation right now in our research and going back to do more sequential performance work. I've got a small army of undergraduates implementing compiler optimizations and analyses :-)