One interesting option for these big memory scans on x86 and ARM CPUs is using the non-temporal load/store instructions. Those actually bypass caching* and may help with the cache pressure of LLM workloads that just do scans. The lookup table is still probably the wrong solution even with this sort of thing.
* Not quite all of it - There are still buffers to do write combining and some read caching on scans.
* Not quite all of it - There are still buffers to do write combining and some read caching on scans.