After primarily using AVX2, I don't think masked instructions and scatter/gather are particularly useful. Emulating masked computations with a blend is cheap. Emulating compress and some missing shuffles is expensive. Masked stores and loads don't really help with anything except for an edge case where they don't cause page faults on the part that was masked out.