I wonder if, instead of just predicting the next n tokens, it could also predict like 128, 512, 2048 etc tokens ahead. Thus learning long-term discourse structure.
Might be good to have some flexibility in where those particular tokens are placed, but yeah -- I could see value in creating a "pool" of tokens that should be used at some point in the future in the answer.